Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative sum sorted descending within a group. Pandas

Tags:

python

pandas

I've faced a problem while applying sort_values() and cumsum() within a group.

I have a dataaset:

enter image description here

Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.

to get first

enter image description here

and then, just select 90% of sales within each region

enter image description here

I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method

I've tried apply also..

import pandas as pd
df = pd.DataFrame({'id':['id_1', 
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1', 
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
               'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
               'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()

Thank you for any suggestions

like image 706
Vero Avatar asked Dec 07 '25 15:12

Vero


2 Answers

You can definitely sort the dataframe first, then do groupby():

df.sort_values(['region','sales'], ascending=[True,False],inplace=True)

df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')

df['cummul'] = df.groupby('region')['%'].cumsum()

# filter
df[df['cummul'].le(0.9)]

Output:

      id  region  sales         %    cummul
5   id_6       1     98  0.216336  0.216336
4   id_5       1     78  0.172185  0.388521
6   id_7       1     76  0.167770  0.556291
3   id_4       1     56  0.123620  0.679912
0   id_1       1     54  0.119205  0.799117
1   id_2       1     34  0.075055  0.874172
9   id_2       2     89  0.204598  0.204598
10  id_3       2     76  0.174713  0.379310
14  id_7       2     56  0.128736  0.508046
11  id_4       2     54  0.124138  0.632184
15  id_8       2     54  0.124138  0.756322
13  id_6       2     45  0.103448  0.859770
like image 110
Quang Hoang Avatar answered Dec 09 '25 06:12

Quang Hoang


First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.

Then we sort by region and %, no need for groupby.

After we sort, we create the cumul column.

And finally we select those within the 90% range with query:

df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()

df.query('cumul.le(90)')

output

      id  region  sales     %  cumul
5   id_6       1     98  22.0   22.0
4   id_5       1     78  17.0   39.0
6   id_7       1     76  17.0   56.0
0   id_1       1     54  12.0   68.0
3   id_4       1     56  12.0   80.0
1   id_2       1     34   8.0   88.0
9   id_2       2     89  20.0   20.0
10  id_3       2     76  17.0   37.0
14  id_7       2     56  13.0   50.0
11  id_4       2     54  12.0   62.0
15  id_8       2     54  12.0   74.0
13  id_6       2     45  10.0   84.0
like image 32
Erfan Avatar answered Dec 09 '25 06:12

Erfan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!