I have the following code where I like to do a train/test split on a Dask dataframe
df = dd.read_csv(csv_filename, sep=',', encoding="latin-1",
names=cols, header=0, dtype='str')
But when I try to do slices like
for train, test in cv.split(X, y):
df.fit(X[train], y[train])
it fails with the error
KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index'
Any ideas?
This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example, df. partitions[:5] produces a new Dask Dataframe of the first five partitions.
set_index syntaxCreate a pandas DataFrame with two columns of data, and a 2-partition Dask DataFrame from it. Print the DataFrame and see that it has one index column that was created by default by pandas and two columns with data. Take a look at the divisions of ddf. ddf has two divisions.
Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)
That's where Dask arrays provide much more flexibility than Numpy. They enable you to work with larger-than-memory objects, and computation time is significantly faster due to parallelization.
Dask.dataframe doesn't support row-wise slicing. It does support the loc
operation if you have a sensible index.
However in your case of train/test splitting you will probably be better served by the random_split method.
train, test = df.random_split([0.80, 0.20])
You could also make many splits and concat in different ways
splits = df.random_split([0.20, 0.20, 0.20, 0.20, 0.20])
for i in range(5):
trains = [splits[j] for j in range(5) if j != i]
train = dd.concat(trains, axis=0)
test = splits[i]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With