Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Returning top n values for group/multiindex in Pandas

I have a df that contains daily product and volume data:

date        product     volume
20160101    A           10
20160101    B           5
...
20160102    A           20
...
...
20160328    B           20
20160328    C           100
...
20160330    D           20

I've grouped it up by month via

df['yearmonth'] = df.date.astype(str).str[:6]
grouped = df.groupby(['yearmonth','product'])['Volume'].sum()

which gives me a Series of the form:

yearmonth   product 
201601      A       100
            B       90
            C       90
            D       85
            E       180
            F       50
            ...
201602      A       200
            C       120
            F       220
            G       40
            I       50
            ...
201603      B       120
            C       110
            D       110
            ...

I want to return the top n volume values per product per month. For example the top 3 values would return:

201601  A  100
        B   90
        C   90
        E   180
201602  A   200
        C   120
        F   220
201603  B   120
        C   110
        D   110

I can find some answers using pd.IndexSlice and select but they seem to act on the index alone. I can't figure out how to sort the individual group's values

  • Pandas report top-n in group and pivot (which is Wes's example in "Python for Data Analysis" too)
  • pandas multi index sort specific fields
  • pandas: slice a MultiIndex by range of secondary index
like image 502
comedyDave Avatar asked Jan 18 '17 15:01

comedyDave


People also ask

What does the pandas function MultiIndex From_tuples do?

from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.

Does pandas Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.


1 Answers

You can use SeriesGroupBy.nlargest:

print (grouped.groupby(level='yearmonth').nlargest(3).reset_index(level=0, drop=True))
yearmonth  product
201601     E          180
           A          100
           B           90
201602     F          220
           A          200
           C          120
201603     B          120
           C          110
           D          110
Name: val, dtype: int64

Also you can use to_datetime with to_period for convert to year-month period:

print (df)
        date product  Volume
0   20160101       A      10
1   20160101       B       5
2   20160101       C      10
3   20160101       D       5
4   20160102       A      20
5   20160102       A      10
6   20160102       B       5
7   20160102       C      10
8   20160102       D       5
9   20160328       A      20
10  20160328       C     100
11  20160328       B      20
12  20160328       D      20
13  20160330       D      20

grouped = df.groupby([pd.to_datetime(df.date, format='%Y%m%d').dt.to_period('M'),
                     'product'])['Volume'].sum()
print (grouped)
date     product
2016-01  A           40
         B           10
         C           20
         D           10
2016-03  A           20
         B           20
         C          100
         D           40
Name: Volume, dtype: int64

print (grouped.groupby(level='date').nlargest(3).reset_index(level=0, drop=True))
date     product
2016-01  A           40
         C           20
         B           10
2016-03  C          100
         D           40
         A           20
Name: Volume, dtype: int64
like image 197
jezrael Avatar answered Oct 24 '22 14:10

jezrael