Using pandas read_csv iterator for Streaming Large Data – Best Practices 2026
The chunksize parameter in pd.read_csv() turns the reader into a powerful iterator. This is the most common and effective way to stream and process very large CSV files without loading the entire dataset into memory.
TL;DR — Core Pattern
- Use
pd.read_csv(..., chunksize=N) - Each iteration gives you a normal pandas DataFrame chunk
- Process, enrich, or save each chunk independently
1. Basic Streaming with chunksize
import pandas as pd
chunk_size = 100_000
for chunk in pd.read_csv("large_sales_data.csv",
chunksize=chunk_size,
parse_dates=["order_date"],
dtype={"customer_id": "int32", "amount": "float32"}):
print(f"Processing chunk with {len(chunk):,} rows")
# Perform any operations on this chunk
chunk["profit"] = chunk["amount"] * 0.25
chunk["log_amount"] = chunk["amount"].apply(lambda x: round(x**0.5, 2) if x > 0 else 0)
# Save or aggregate results
# chunk.to_parquet(f"processed/chunk_{i}.parquet")
2. Real-World Streaming Pipeline
total_sales = 0.0
region_totals = {}
for chunk in pd.read_csv("10GB_sales.csv", chunksize=50000):
# Update running aggregates
total_sales += chunk["amount"].sum()
# Group statistics per chunk
chunk_group = chunk.groupby("region")["amount"].sum()
for region, sales in chunk_group.items():
region_totals[region] = region_totals.get(region, 0) + sales
print(f"Grand Total Sales: ${total_sales:,.2f}")
for region, total in sorted(region_totals.items(), key=lambda x: x[1], reverse=True):
print(f"{region:12} : ${total:,.2f}")
3. Best Practices in 2026
- Choose chunk size based on available RAM (typically 50k–200k rows)
- Always specify
dtypesandparse_datesto reduce memory usage - Perform filtering and enrichment inside the loop
- Write processed chunks to Parquet or a database instead of keeping everything in memory
- Use
low_memory=Falseonly when necessary - Monitor memory with
tracemallocduring development
Conclusion
Using pd.read_csv() with chunksize is the standard way to stream large CSV files in data science in 2026. It turns a potentially crashing script into a stable, memory-efficient pipeline. Combine it with proper dtype specification, chunked processing, and incremental aggregation to handle files of almost any size on standard hardware.
Next steps:
- Take one of your large CSV files and rewrite the loading script using
chunksizeiterator