Logging, Error Handling & Monitoring in Data Science Pipelines – Complete Guide 2026
In production data science, pipelines run 24/7, process terabytes of data, and power critical business decisions. When something goes wrong, you need to know exactly what happened, where it happened, and why. In 2026, professional data scientists treat logging, error handling, and monitoring as core skills — not afterthoughts. This article shows you how to build observable, debuggable, and resilient data pipelines using modern Python tools.
TL;DR — Key Practices 2026
- Replace every
print()with structured logging - Use the built-in
loggingmodule + JSON handlers - Create custom exceptions for data-specific errors
- Log context (file name, row count, model version, environment)
- Integrate with monitoring platforms (Sentry, Prometheus, Datadog, Grafana)
- Always log at the right level: INFO, WARNING, ERROR, CRITICAL
1. Modern Logging Setup (2026 Best Practice)
import logging
from pathlib import Path
import json
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
handlers=[
logging.FileHandler("pipeline.log"),
logging.StreamHandler()
]
)
logger = logging.getLogger("data_pipeline")
logger.info("Starting daily ETL pipeline for file %s", Path("sales_20260320.csv"))
2. Structured Logging with JSON (Production Standard)
class JSONFormatter(logging.Formatter):
def format(self, record):
return json.dumps({
"timestamp": record.asctime,
"level": record.levelname,
"module": record.module,
"message": record.getMessage(),
"extra": getattr(record, "extra", {})
})
handler = logging.FileHandler("pipeline.jsonl")
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
3. Custom Exceptions for Data Science
class DataValidationError(Exception):
"""Raised when data fails business validation rules."""
pass
class SchemaMismatchError(Exception):
"""Raised when incoming data schema does not match expected schema."""
pass
def validate_sales_data(df):
if "customer_id" not in df.columns:
raise SchemaMismatchError("Missing customer_id column")
if df["amount"].min() < 0:
raise DataValidationError("Negative amounts detected")
logger.info("Data validation passed - %d rows", len(df))
4. Real-World Pipeline with Full Error Handling
def run_daily_pipeline():
try:
logger.info("Pipeline started")
df = load_raw_data()
df = clean_and_validate(df)
model = train_or_load_model()
predictions = model.predict(df)
save_results(predictions)
logger.info("Pipeline completed successfully")
except DataValidationError as e:
logger.error("Validation failed: %s", e)
notify_slack("Data validation error in daily pipeline")
except Exception as e:
logger.critical("Unexpected error in pipeline: %s", e, exc_info=True)
raise
5. Monitoring & Alerting in 2026
Modern data teams integrate logging with:
- Sentry – for error tracking and stack traces
- Prometheus + Grafana – for pipeline metrics and dashboards
- Datadog – for end-to-end observability
- MLflow / Weights & Biases – for model monitoring
Best Practices in 2026
- Never use
print()in production code — always use logger - Log at appropriate levels and include rich context
- Use structured (JSON) logs for easy parsing by monitoring tools
- Always catch and log exceptions with
exc_info=True - Set up alerts for ERROR and CRITICAL logs
- Include pipeline metadata (version, environment, git commit) in every log
Conclusion
In 2026, a data pipeline without proper logging, error handling, and monitoring is considered incomplete and unprofessional. These practices turn fragile scripts into reliable, observable production systems that your entire team can trust and debug quickly.
Next steps:
- Replace every
print()statement in your current project with proper logging - Add custom exceptions and structured JSON logging to your main pipeline
- Integrate one monitoring tool (Sentry or Grafana) this week