I have the following dataset:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
4 13.0
5 4.0
And I'm trying to get the nlargest items of category in dataframe grouped by location. i.e. If I want the top 2 largest percentages for each group the output should be:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
It looks like in pandas this is relatively straight forward using pandas.core.groupby.SeriesGroupBy.nlargest
but dask doesn't have an nlargest
function for groupby. Have been playing around with apply
but can't seem to get it to work properly.
df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()
But I just get the error ValueError: Wrong number of items passed 0, placement implies 8
The apply should work, but your syntax is a little off:
In [11]: df
Out[11]:
Dask DataFrame Structure:
Unnamed: 0 location category percent
npartitions=1
int64 object int64 float64
... ... ... ...
Dask Name: from-delayed, 3 tasks
In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: x, dtype: float64
In pandas you'd have .nlargest
and .rank
as groupby methods which would let you do this without the apply:
In [21]: df1
Out[21]:
location category percent
0 A 5 100.0
1 B 3 100.0
2 C 2 50.0
3 C 4 13.0
4 D 2 75.0
5 D 3 59.0
6 D 4 13.0
7 D 5 4.0
In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: percent, dtype: float64
The dask documentation notes:
Dask.dataframe covers a small but well-used portion of the pandas API.
This limitation is for two reasons:
- The pandas API is huge
- Some operations are genuinely hard to do in parallel (for example sort).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With