Using Generator Functions in Python – Practical Patterns for Data Science 2026

Using Generator Functions in Python – Practical Patterns for Data Science 2026

Generator functions (using yield) are one of the most powerful tools for writing memory-efficient and clean data processing code. Once you learn how to use them effectively, they become essential for handling large datasets, streaming data, and building reusable pipelines.

TL;DR — How to Use a Generator Function

Call the function like a normal function
It returns a generator object (lazy iterator)
Consume it with for loop, sum(), next(), or list()

1. Basic Usage

def high_value_sales(records):
    for row in records:
        if row["amount"] > 1500:
            yield row

# Usage
sales_data = [{"customer_id": 101, "amount": 2300}, ...]

for sale in high_value_sales(sales_data):
    print(f"High value sale by customer {sale['customer_id']}: ${sale['amount']}")

2. Real-World Data Science Patterns

import pandas as pd

# Pattern 1: Chunked processing
def process_in_chunks(df, chunk_size=50000):
    for start in range(0, len(df), chunk_size):
        chunk = df.iloc[start:start+chunk_size]
        yield chunk

for chunk in process_in_chunks(df):
    chunk["profit"] = chunk["amount"] * 0.25
    print(f"Processed {len(chunk)} rows")

# Pattern 2: Feature engineering pipeline
def enrich_sales_data(df):
    for row in df.itertuples():
        profit = row.amount * 0.25
        yield {
            "customer_id": row.customer_id,
            "amount": row.amount,
            "profit": round(profit, 2),
            "category": "Premium" if profit > 500 else "Standard",
            "log_amount": round(row.amount ** 0.5, 2) if row.amount > 0 else 0
        }

# Consume the generator
enriched = list(enrich_sales_data(df))   # Only if you need the full list
enriched_df = pd.DataFrame(enriched)

3. Advanced Consumption Patterns

# Sum without loading everything
total_profit = sum(row["profit"] for row in enrich_sales_data(df))

# Get first N items
import itertools
top_10 = list(itertools.islice(enrich_sales_data(df), 10))

# Manual control with next()
gen = enrich_sales_data(df)
first_record = next(gen)

4. Best Practices in 2026

Use generator functions for any processing that can be done row-by-row or chunk-by-chunk
Keep generators stateless and pure when possible
Use them inside other functions or pipelines
Document what the generator yields
Convert to list only when you really need random access or multiple passes

Conclusion

Generator functions are the professional way to handle large-scale data processing in Python. In 2026 data science projects, they allow you to write clean, reusable, and extremely memory-efficient code. Master the pattern of building and consuming generator functions, and you will be able to process datasets of almost any size without running out of memory.

Next steps:

Convert one of your current data processing loops into a generator function and integrate it into your workflow

Using Generator Functions in Python – Practical Patterns for Data Science 2026

TL;DR — How to Use a Generator Function

1. Basic Usage

2. Real-World Data Science Patterns

3. Advanced Consumption Patterns

4. Best Practices in 2026

Conclusion

Related Articles in Data Science Tool Box 2026

Data Science Tool Box – Complete Guide & Best Practices 2026

Using zip() in Python – Parallel Iteration Made Simple for Data Science 2026

Using pandas read_csv iterator for Streaming Large Data – Best Practices 2026

Generating content...