Building Reusable Python Packages for Data Scientists 2026
Stop copying the same utility functions, feature engineering code, and validation logic across multiple projects. In 2026, professional data scientists build and maintain reusable Python packages that can be installed with a single uv add or pip install. This article shows you exactly how to create, structure, test, document, and publish production-grade Python packages tailored for data science work.
TL;DR — Modern Package Creation 2026
- Use
pyproject.toml+uv(the new standard) - Follow the
srclayout for clean imports - Include type hints, comprehensive docstrings, and tests
- Automate with Ruff, Pyright, pytest, and GitHub Actions
- Publish to PyPI or your private index for team-wide reuse
1. Project Structure (2026 Standard)
my_data_utils/
├── pyproject.toml
├── README.md
├── LICENSE
├── src/
│ └── my_data_utils/
│ ├── __init__.py
│ ├── data_loader.py
│ ├── feature_engineering.py
│ └── validation.py
├── tests/
│ └── test_feature_engineering.py
└── examples/
└── usage.ipynb
2. Modern pyproject.toml Setup
[project]
name = "my-data-utils"
version = "1.2.0"
description = "Reusable utilities for data science pipelines"
requires-python = ">=3.11"
dependencies = [
"polars>=1.0.0",
"pydantic>=2.0.0",
"pyarrow>=15.0.0"
]
[tool.uv]
dev-dependencies = ["pytest", "ruff", "pyright"]
[tool.ruff]
line-length = 100
3. Real-World Example: Reusable Feature Engineering
# src/my_data_utils/feature_engineering.py
from pydantic import BaseModel
import polars as pl
class FeatureConfig(BaseModel):
target: str
categorical_cols: list[str]
def engineer_features(df: pl.DataFrame, config: FeatureConfig) -> pl.DataFrame:
"""Apply common feature engineering steps used across projects."""
df = df.with_columns([
(pl.col("amount") * 1.1).alias("taxed_amount"),
pl.col(config.target).log().alias("log_target")
])
return df
4. Testing, Linting & Automation
# tests/test_feature_engineering.py
def test_engineer_features():
df = pl.DataFrame({"amount": [100, 200]})
config = FeatureConfig(target="amount", categorical_cols=[])
result = engineer_features(df, config)
assert "taxed_amount" in result.columns
5. Best Practices in 2026
- Use
uvfor dependency management and packaging - Always include type hints and Pydantic models for configuration
- Write tests for every public function
- Automate linting, type checking, and testing with GitHub Actions
- Publish to PyPI or your company’s private index
- Keep packages small and focused (one responsibility per package)
Conclusion
Building reusable Python packages is the highest-leverage skill for data scientists in 2026. Instead of copying code between projects, you install your own package with one command and get consistent, tested, documented utilities everywhere. This practice dramatically increases productivity, reduces bugs, and makes collaboration with software engineers seamless.
Next steps:
- Create your first reusable package today using the structure above
- Move your most-used utility functions into it and install it across all projects
- Continue the “Software Engineering For Data Scientists” series to learn more production skills