How to Build a Generator Function in Python – Step-by-Step Guide for Data Science 2026
Building your own generator functions with yield is one of the most valuable skills for handling large-scale data in Python. Unlike regular functions that return once, generator functions can pause and resume, producing values one at a time with minimal memory usage.
TL;DR — Core Rules
- Use
defandyieldinstead ofreturn - The function automatically becomes a generator when it contains
yield - Call it like a normal function — it returns a generator object
1. Simple Generator Function
def count_up_to(n):
i = 1
while i <= n:
yield i
i += 1
# Usage
for number in count_up_to(5):
print(number)
2. Real Data Science Generator Functions
import pandas as pd
# Example 1: Row-by-row processor
def process_sales_rows(df):
for row in df.itertuples():
profit = row.amount * 0.25
category = "Premium" if profit > 500 else "Standard"
yield {
"customer_id": row.customer_id,
"amount": row.amount,
"profit": round(profit, 2),
"category": category
}
# Example 2: Chunked file reader with enrichment
def read_and_enrich_large_csv(file_path, chunk_size=50000):
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
chunk["profit"] = chunk["amount"] * 0.25
chunk["log_amount"] = chunk["amount"].apply(lambda x: round(x**0.5, 2) if x > 0 else 0)
yield chunk
# Usage
for enriched_chunk in read_and_enrich_large_csv("huge_sales.csv"):
print(f"Processed chunk with {len(enriched_chunk)} rows")
# Save or further process this chunk
3. Advanced Generator with Multiple Yields & State
def batch_processor(data, batch_size=100):
batch = []
for item in data:
batch.append(item)
if len(batch) >= batch_size:
yield batch
batch = []
if batch:
yield batch
# Usage
for batch in batch_processor(large_dataset):
print(f"Processing batch of {len(batch)} items")
4. Best Practices for Building Generators in 2026
- Keep generator functions focused on one clear responsibility
- Use descriptive names and document what is being yielded
- Prefer
itertuples()overiterrows()inside generators for speed - Use
yield fromwhen delegating to another generator - Test generators with
next()and small inputs first
Conclusion
Building custom generator functions is a key skill for modern data science. In 2026, they allow you to process massive datasets, build reusable pipelines, and keep memory usage low. Start simple, practice the yield pattern, and gradually move your data processing code from full lists to powerful, lazy generators.
Next steps:
- Take one of your existing data processing scripts and rewrite the core loop as a custom generator function