Using Iterators to Load Large Files into Memory – Memory-Efficient Data Loading 2026
When working with large files (several GB or more), loading the entire file into memory at once can cause out-of-memory errors. In 2026, using iterators is the standard approach for processing large files efficiently in data science pipelines.
TL;DR — Recommended Iterator-Based Loading
- Use
open()with aforloop for line-by-line processing - Use
csv.DictReaderfor structured CSV files - Use Pandas
chunksizefor large structured data - Use generators (
yield) for custom processing logic
1. Basic Line-by-Line Iteration (Most Memory Efficient)
# For very large text or log files
with open("large_log.txt", "r", encoding="utf-8") as f:
for line in f: # This is an iterator - loads one line at a time
line = line.strip()
if line and "ERROR" in line:
print(line[:200]) # Process only what you need
2. Using csv Module with Iterator
import csv
with open("huge_dataset.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f) # This returns an iterator
for row in reader: # Processes row by row
amount = float(row["amount"])
if amount > 10000:
print(f"High value transaction: {row['customer_id']} - ${amount:,.2f}")
3. Pandas chunksize – Best for Large Structured Files
import pandas as pd
# Process large CSV in chunks (each chunk is a DataFrame)
chunk_size = 100_000
for chunk in pd.read_csv("large_sales.csv", chunksize=chunk_size, parse_dates=["order_date"]):
# Process each chunk independently
chunk_summary = chunk.groupby("region")["amount"].agg(["sum", "mean", "count"]).round(2)
print(f"Processed chunk with {len(chunk)} rows")
# Save or aggregate results from this chunk
4. Custom Generator for Maximum Control
def read_large_file(file_path, batch_size=10000):
"""Generator that yields batches of processed rows."""
with open(file_path, "r", encoding="utf-8") as f:
batch = []
for line in f:
batch.append(line.strip())
if len(batch) >= batch_size:
yield batch # Yield the batch and pause
batch = []
if batch:
yield batch
# Usage
for batch in read_large_file("huge_file.txt"):
# Process batch
processed = [line.upper() for line in batch if line]
print(f"Processed batch of {len(processed)} lines")
Best Practices in 2026
- Use **iterators** (line-by-line or chunked) for any file larger than available RAM
- Prefer
csv.DictReaderover manual splitting for CSV files - Use Pandas
chunksizewhen you need DataFrame operations on large files - Write custom generators when you need full control over processing logic
- Always use the
withstatement to ensure files are properly closed
Conclusion
Using iterators to load large files is a critical skill for data scientists working with real-world datasets. In 2026, the best practice is to never load massive files entirely into memory. Instead, process them iteratively using file iterators, csv.DictReader, Pandas chunks, or custom generators. This approach keeps memory usage low and allows you to handle files of almost any size.
Next steps:
- Try processing a large CSV or log file using one of the iterator patterns shown above