I would like to know if it is possible to have the number of unique items from a given column after a groupBy aggregation with Dask. I don't see anything like this in the documentation. It is available on pandas dataframe and really useful. I've seen some issue related to this, but I am not sure it is implemented.
Can someone give me some hints about this?
To implement nunique in dask groupby you have to use an aggregate function.
import pandas as pd
import dask.dataframe as dd
def chunk(s):
'''
The function applied to the
individual partition (map)
'''
return s.apply(lambda x: list(set(x)))
def agg(s):
'''
The function whic will aggrgate
the result from all the partitions(reduce)
'''
s = s._selected_obj
return s.groupby(level=list(range(s.index.nlevels))).sum()
def finalize(s):
'''
The optional functional that will be
applied to the result of the agg_tu functions
'''
return s.apply(lambda x: len(set(x)))
tunique = dd.Aggregation('tunique', chunk, agg,finalize)
df = pd.DataFrame({
'col': [0, 0, 1, 1, 2, 3, 3] * 10,
'g0': ['a', 'a', 'b', 'a', 'b', 'b', 'a'] * 10,
})
ddf = dd.from_pandas(df, npartitions=10)
res = ddf.groupby(['col']).agg({'g0': tunique}).compute()
print(res)
To expand on this comment you can use nunique
on a SeriesGroupBy directly:
import pandas as pd
import dask.dataframe as dd
d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)
ddf = dd.from_pandas(df, npartitions=2)
ddf.groupby(['col1']).col2.nunique().to_frame().compute()
See https://github.com/dask/dask/issues/6280 for more discussion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With