Using Generator Functions in Python – Practical Patterns for Data Science 2026
Generator functions (using yield) are one of the most powerful tools for writing memory-efficient and clean data processing code. Once you learn how to use them effectively, they become essential for handling large datasets, streaming data, and building reusable pipelines.
TL;DR — How to Use a Generator Function
- Call the function like a normal function
- It returns a generator object (lazy iterator)
- Consume it with
forloop,sum(),next(), orlist()
1. Basic Usage
def high_value_sales(records):
for row in records:
if row["amount"] > 1500:
yield row
# Usage
sales_data = [{"customer_id": 101, "amount": 2300}, ...]
for sale in high_value_sales(sales_data):
print(f"High value sale by customer {sale['customer_id']}: ${sale['amount']}")
2. Real-World Data Science Patterns
import pandas as pd
# Pattern 1: Chunked processing
def process_in_chunks(df, chunk_size=50000):
for start in range(0, len(df), chunk_size):
chunk = df.iloc[start:start+chunk_size]
yield chunk
for chunk in process_in_chunks(df):
chunk["profit"] = chunk["amount"] * 0.25
print(f"Processed {len(chunk)} rows")
# Pattern 2: Feature engineering pipeline
def enrich_sales_data(df):
for row in df.itertuples():
profit = row.amount * 0.25
yield {
"customer_id": row.customer_id,
"amount": row.amount,
"profit": round(profit, 2),
"category": "Premium" if profit > 500 else "Standard",
"log_amount": round(row.amount ** 0.5, 2) if row.amount > 0 else 0
}
# Consume the generator
enriched = list(enrich_sales_data(df)) # Only if you need the full list
enriched_df = pd.DataFrame(enriched)
3. Advanced Consumption Patterns
# Sum without loading everything
total_profit = sum(row["profit"] for row in enrich_sales_data(df))
# Get first N items
import itertools
top_10 = list(itertools.islice(enrich_sales_data(df), 10))
# Manual control with next()
gen = enrich_sales_data(df)
first_record = next(gen)
4. Best Practices in 2026
- Use generator functions for any processing that can be done row-by-row or chunk-by-chunk
- Keep generators stateless and pure when possible
- Use them inside other functions or pipelines
- Document what the generator yields
- Convert to list only when you really need random access or multiple passes
Conclusion
Generator functions are the professional way to handle large-scale data processing in Python. In 2026 data science projects, they allow you to write clean, reusable, and extremely memory-efficient code. Master the pattern of building and consuming generator functions, and you will be able to process datasets of almost any size without running out of memory.
Next steps:
- Convert one of your current data processing loops into a generator function and integrate it into your workflow