I have the following code where I like to do a train/test split on a Dask dataframe <pre class="prettyprint"><code>df = dd.read_csv(csv_filename, sep=',', encoding="latin-1", names=cols, header=0, dtype='str') </code></pre> But when I try to do slices like <pre class="prettyprint"><code>for train, test in cv.split(X, y): df.fit(X[train], y[train]) </code></pre> it fails with the error <pre class="prettyprint"><code>KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index' </code></pre> Any ideas?

Dask.dataframe doesn't support row-wise slicing. It does support the <code>loc</code> operation if you have a sensible index. However in your case of train/test splitting you will probably be better served by the random_split method. <pre class="prettyprint"><code>train, test = df.random_split([0.80, 0.20]) </code></pre> You could also make many splits and concat in different ways <pre class="prettyprint"><code>splits = df.random_split([0.20, 0.20, 0.20, 0.20, 0.20]) for i in range(5): trains = [splits[j] for j in range(5) if j != i] train = dd.concat(trains, axis=0) test = splits[i] </code></pre>

Slicing a Dask Dataframe

Tags:

python

dataframe

dask

I have the following code where I like to do a train/test split on a Dask dataframe

df = dd.read_csv(csv_filename, sep=',', encoding="latin-1",
                     names=cols, header=0, dtype='str')

But when I try to do slices like

for train, test in cv.split(X, y):
    df.fit(X[train], y[train])

it fails with the error

KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index'

Any ideas?

713

asked Jun 10 '17 16:06

Zubair Ahmed

1 Answers

Dask.dataframe doesn't support row-wise slicing. It does support the loc operation if you have a sensible index.

However in your case of train/test splitting you will probably be better served by the random_split method.

train, test = df.random_split([0.80, 0.20])

You could also make many splits and concat in different ways

splits = df.random_split([0.20, 0.20, 0.20, 0.20, 0.20])

for i in range(5):
    trains = [splits[j] for j in range(5) if j != i]
    train = dd.concat(trains, axis=0)
    test = splits[i]

answered Oct 11 '22 22:10

MRocklin

Related questions
                            
                                matplotlibrc rcParams modified for Jupyter inline plots
                            
                                TypeError at /app/profile/ , 'list' object is not callable handle_pageBegin args=()
                            
                                Extracting text from pdf using Python and Pypdf2
                            
                                Django: generate a CSV file and store it into FileField
                            
                                pymongo typeError: document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument
                            
                                How can a test in Python unittest get access to the verbosity level?
                            
                                How to convert multipage PDF to list of image objects in Python?
                            
                                Python/Matplotlib: adding regression line to a plot given its intercept and slope
                            
                                Python. Parameters and returned values
                            
                                Reconstruction of tensor in sktensor/scikit-tensor after decomposition using HOSVD
                            
                                How to transpose rows to columns in Pandas?
                            
                                Sphinx: different relative paths to same figure possible?
                            
                                Adding a grouped, aggregate nunique column to pandas dataframe
                            
                                How to 'pause' a spider in Scrapy?
                            
                                In Rust, what is the proper way to replicate Python's "repeat" parameter in itertools.product?
                            
                                python __getattr__ autocompletion
                            
                                ipython: exit() statement not working as expected
                            
                                Concat python dataframes based on unique rows
                            
                                In python type-hinting, how can I make an argument accept any subclass of a base class?
                            
                                Pyinstaller .exe throws Windows Defender [no publisher]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With