Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More effective / clean way to aggregate data

python 3.7.10 pandas 1.1.5

Imagine we have Dataframe with two columns containing categories and third column with numbers. Task is to group by first category and then subgroup by second category and calculate totals and shares.

import pandas as pd

df = pd.DataFrame({
    'fruit': ['orange', 'orange', 'orange', 'banana', 'banana', 'banana'],
    'origin': ['USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'],
    'weight': [1, 2, 3, 4, 5, 6]
})
df
fruit origin weight
0 orange USA 1
1 orange Canada 2
2 orange USA 3
3 banana Canada 4
4 banana USA 5
5 banana Canada 6
(df
 .groupby('fruit')
 .apply(lambda x: (x
                   .groupby('origin')
                   .agg({'weight': sum})
                   .assign(share=lambda x: x.weight / x.weight.sum()))
 )
)
fruit origin weight share
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667

Is there a more pythonic / pandish / cleaner way to achieve the same result. For example, I can't rename weight on the fly in case it's not sum but rather count and I want column name to reflect this.

In R it looks to me much cleaner.

library(dplyr)

df <- tibble(
  fruit = c('orange', 'orange', 'orange', 'banana', 'banana', 'banana'),
  origin = c('USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'),
  weight = c(1, 2, 3, 4, 5, 6)
)

df %>%
  group_by(fruit, origin) %>%
  summarise(total = sum(weight)) %>%
  mutate(share = total / sum(total))

I believe there is some cleaner way to do it in python.

like image 583
Roman Shevtsiv Avatar asked Dec 07 '22 09:12

Roman Shevtsiv


1 Answers

You can have two separate groupby statements to make it cleaner:

In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [104]: x['share'] = x.groupby('fruit')['weight'].apply(lambda i: i/i.sum())

In [105]: x
Out[105]: 
    fruit  origin  weight     share
0  banana  Canada      10  0.666667
1  banana     USA       5  0.333333
2  orange  Canada       2  0.333333
3  orange     USA       4  0.666667

OR, as per @Manakin's comment, avoiding apply:

In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [109]: x['share'] = x['weight'].div(x.groupby('fruit')['weight'].transform('sum'))

In [110]: x
Out[110]: 
    fruit  origin  weight     share
0  banana  Canada      10  0.666667
1  banana     USA       5  0.333333
2  orange  Canada       2  0.333333
3  orange     USA       4  0.666667
like image 194
Mayank Porwal Avatar answered Dec 30 '22 01:12

Mayank Porwal