Generators are a great tool for working with large datasets that may not fit into memory all at once. They allow you to process data one item at a time, which means that you only need to have one item in memory at a time instead of the entire dataset.
One common use case for generators with large datasets is to read data from a file or database in chunks. For example, let's say you have a large dataset stored in a CSV file and you want to process it in chunks of 1000 rows at a time:
import csvdef read_csv(file_path, chunk_size=1000): with open(file_path) as f: reader = csv.reader(f) # skip header row next(reader) chunk = [] for i, row in enumerate(reader): chunk.append(row) if (i + 1) % chunk_size == 0: yield chunk chunk = [] if chunk: yield chunk |
In this example, we define a generator function called read_csv that takes a file path as an argument and a chunk size (default is 1000 rows). The function opens the file using a context manager, creates a csv.reader object, and skips the header row using the next method. It then initializes an empty list called chunk and loops over each row in the reader object using a for loop. On each iteration, it appends the row to the chunk list and checks whether the length of the list is equal to the chunk size. If it is, it yields the chunk and resets the chunk list to an empty list. Finally, after the loop completes, if there are any remaining rows in the chunk list, it yields that as well.
To use the read_csv generator, we can call it with the file path of the CSV file we want to read, and then iterate over the generator object using a for loop. On each iteration, we get a chunk of 1000 rows from the CSV file, and we can process that chunk in whatever way we need to.
Generators like this allow you to process very large datasets without having to load the entire dataset into memory at once. Instead, you can work with the data one chunk at a time, which can greatly reduce memory usage and allow you to process much larger datasets than you would be able to otherwise.