Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - aggregate, sort and nlargest inside groupby

Tags:

python

pandas

I have following dataframe:

                       some_id
2016-12-26 11:03:10        001
2016-12-26 11:03:13        001
2016-12-26 12:03:13        001
2016-12-26 12:03:13        008
2016-12-27 11:03:10        009
2016-12-27 11:03:13        009
2016-12-27 12:03:13        003
2016-12-27 12:03:13        011

And i need to do something like transform('size') with following sort and get N max values. To get something like this (N=2):

             some_id   size
2016-12-26       001      3
                 008      1
2016-12-27       009      2
                 003      1

Is there elegant way to do that in pandas 0.19.x?

like image 801
Alex Zaitsev Avatar asked Dec 26 '16 16:12

Alex Zaitsev


2 Answers

Use value_counts to compute distinct counts after grouping on the date part of your DateTimeIndex. This sorts them in descending order by default.

You only need to take the topmost 2 rows of this result to get the largest (top-2) part.

fnc = lambda x: x.value_counts().head(2)
grp = df.groupby(df.index.date)['some_id'].apply(fnc).reset_index(1, name='size')
grp.rename(columns={'level_1':'some_id'})

enter image description here

like image 51
Nickil Maveli Avatar answered Sep 18 '22 21:09

Nickil Maveli


setup

from io import StringIO
import pandas as pd

txt = """                 some_id
2016-12-26 11:03:10        001
2016-12-26 11:03:13        001
2016-12-26 12:03:13        001
2016-12-26 12:03:13        008
2016-12-27 11:03:10        009
2016-12-27 11:03:13        009
2016-12-27 12:03:13        003
2016-12-27 12:03:13        011"""

df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python')

df.index = pd.to_datetime(df.index)
df.some_id = df.some_id.astype(str).str.zfill(3)

df

                    some_id
2016-12-26 11:03:10     001
2016-12-26 11:03:13     001
2016-12-26 12:03:13     001
2016-12-26 12:03:13     008
2016-12-27 11:03:10     009
2016-12-27 11:03:13     009
2016-12-27 12:03:13     003
2016-12-27 12:03:13     011

using nlargest

df.groupby(pd.TimeGrouper('D')).some_id.value_counts() \
    .groupby(level=0, group_keys=False).nlargest(2)

            some_id
2016-12-26  001        3
            008        1
2016-12-27  009        2
            003        1
Name: some_id, dtype: int64
like image 23
piRSquared Avatar answered Sep 17 '22 21:09

piRSquared