Dask arrays provide a convenient way to perform parallel and out-of-core computations on large arrays that do not fit in memory. Here are some basic operations you can perform with Dask arrays for aggregating data:
import dask.array as da# Create a Dask arrayx = da.random.normal(size=(10000, 10000), chunks=(1000, 1000))# Compute the sum of the arraysum_x = x.sum()print(sum_x.compute())# Compute the mean of each rowmean_rows = x.mean(axis=1)print(mean_rows.compute())# Compute the maximum of each columnmax_cols = x.max(axis=0)print(max_cols.compute()) |
In this example, we first create a Dask array x of shape (10000, 10000) with chunk size (1000, 1000) using the random.normal function. We then use the sum method to compute the sum of all elements in the array. The compute method is used to actually perform the computation and obtain the result.
We can also compute aggregate statistics on subsets of the array by specifying the axis parameter to the mean and max methods. For example, we compute the mean of each row of the array by setting axis=1 for the mean method, and we compute the maximum of each column of the array by setting axis=0 for the max method.
The compute method is used to actually perform the computation and obtain the result. Note that all of these computations are performed in parallel, with each chunk of the array being processed independently on different workers.
Dask arrays provide many other methods for aggregating data, including min, argmin, argmax, std, var, prod, and cumsum, among others.