Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffling data in dask

Tags:

python

dask

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.

The answer in that question was to do the following:

for part in df.repartition(npartitions=100).to_delayed():
    batch = part.compute()

However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.

What I would ideally like is something along the lines of:

rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]

which would work on pandas but not dask. Any thoughts?

Edit 1: Potential Solution

I tried doing

train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]

However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.

like image 779
sachinruk Avatar asked Oct 20 '17 03:10

sachinruk


2 Answers

I recommend adding a column of random data to your dataframe and then using that to set the index:

df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')
like image 183
MRocklin Avatar answered Sep 19 '22 10:09

MRocklin


I encountered the same issue recently and came up with a different approach using dask array and shuffle_slice introduced in this pull request

It shuffles the whole sample

import numpy as np
from dask.array.slicing import shuffle_slice

d_arr = df.to_dask_array(True)
df_len = len(df)
np.random.seed(42)
index = np.random.choice(df_len, df_len, replace=False)
d_arr = shuffle_slice(d_arr, index)

and to transform back to dask dataframe

df = d_arr.to_dask_dataframe(df.columns)

for me it works well for large data sets

like image 35
Manuel Guth Avatar answered Sep 20 '22 10:09

Manuel Guth