Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slicing a Dask Dataframe

I have the following code where I like to do a train/test split on a Dask dataframe

df = dd.read_csv(csv_filename, sep=',', encoding="latin-1",
                     names=cols, header=0, dtype='str') 

But when I try to do slices like

for train, test in cv.split(X, y):
    df.fit(X[train], y[train])

it fails with the error

KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index'

Any ideas?

like image 713
Zubair Ahmed Avatar asked Jun 10 '17 16:06

Zubair Ahmed


People also ask

How do I partition a Dask DataFrame?

This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example, df. partitions[:5] produces a new Dask Dataframe of the first five partitions.

How do I index a Dask DataFrame?

set_index syntaxCreate a pandas DataFrame with two columns of data, and a 2-partition Dask DataFrame from it. Print the DataFrame and see that it has one index column that was created by default by pandas and two columns with data. Take a look at the divisions of ddf. ddf has two divisions.

How do I select a row in a Dask DataFrame?

Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)

Is Dask faster than Numpy?

That's where Dask arrays provide much more flexibility than Numpy. They enable you to work with larger-than-memory objects, and computation time is significantly faster due to parallelization.


1 Answers

Dask.dataframe doesn't support row-wise slicing. It does support the loc operation if you have a sensible index.

However in your case of train/test splitting you will probably be better served by the random_split method.

train, test = df.random_split([0.80, 0.20])

You could also make many splits and concat in different ways

splits = df.random_split([0.20, 0.20, 0.20, 0.20, 0.20])

for i in range(5):
    trains = [splits[j] for j in range(5) if j != i]
    train = dd.concat(trains, axis=0)
    test = splits[i]
like image 95
MRocklin Avatar answered Oct 11 '22 22:10

MRocklin