This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.
The answer in that question was to do the following:
for part in df.repartition(npartitions=100).to_delayed():
batch = part.compute()
However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.
What I would ideally like is something along the lines of:
rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]
which would work on pandas but not dask. Any thoughts?
I tried doing
train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]
However, if I try doing train_df.loc[:5,:].compute()
this return a 124451 row dataframe. So clearly using dask wrong.
I recommend adding a column of random data to your dataframe and then using that to set the index:
df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')
I encountered the same issue recently and came up with a different approach using dask array and shuffle_slice introduced in this pull request
It shuffles the whole sample
import numpy as np
from dask.array.slicing import shuffle_slice
d_arr = df.to_dask_array(True)
df_len = len(df)
np.random.seed(42)
index = np.random.choice(df_len, df_len, replace=False)
d_arr = shuffle_slice(d_arr, index)
and to transform back to dask dataframe
df = d_arr.to_dask_dataframe(df.columns)
for me it works well for large data sets
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With