What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead.
Would the equivalent be :
ddf.set_index([col1, col2], sorted=True)
?
My preferred method is to first set_index
using a single column in dask and then distribute Pandas' sort_values
using map_partitions
# Prepare data
import dask
import dask.dataframe as dd
data = dask.datasets.timeseries()
# Sort by 'name' and 'id'
data = data.set_index('name')
data = data.map_partitions(lambda df: df.sort_values(['name', 'id']))
One possible gotcha would that a single index value must not be in multiple partitions. From what I saw in practice though, Dask does not seem to allow that to happen. Would be good to have a more well-founded opinion on that, though.
edit: I have asked about this in Dask dataframe: Can a single index be in multiple partitions?
Sorting in parallel is hard. You have two options in Dask.dataframe
As now, you can call set_index with a single column index:
In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']})
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.set_index('x').compute()
Out[5]:
y
x
1 c
2 b
3 a
Unfortunately dask.dataframe does not (as of November 2016) support multi-column indexes
In [6]: ddf.set_index(['x', 'y']).compute()
NotImplementedError: Dask dataframe does not yet support multi-indexes.
You tried to index with this index: ['x', 'y']
Indexes must be single columns only.
Given how you phrased your question I suspect that this doesn't apply to you, but often cases that use sorting can get by with the much cheaper solution nlargest.
In [7]: ddf.x.nlargest(2).compute()
Out[7]:
0 3
1 2
Name: x, dtype: int64
In [8]: ddf.nlargest(2, 'x').compute()
Out[8]:
x y
0 3 a
1 2 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With