python 3.7.10 pandas 1.1.5
Imagine we have Dataframe with two columns containing categories and third column with numbers. Task is to group by first category and then subgroup by second category and calculate totals and shares.
import pandas as pd
df = pd.DataFrame({
'fruit': ['orange', 'orange', 'orange', 'banana', 'banana', 'banana'],
'origin': ['USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'],
'weight': [1, 2, 3, 4, 5, 6]
})
df
fruit | origin | weight | |
---|---|---|---|
0 | orange | USA | 1 |
1 | orange | Canada | 2 |
2 | orange | USA | 3 |
3 | banana | Canada | 4 |
4 | banana | USA | 5 |
5 | banana | Canada | 6 |
(df
.groupby('fruit')
.apply(lambda x: (x
.groupby('origin')
.agg({'weight': sum})
.assign(share=lambda x: x.weight / x.weight.sum()))
)
)
fruit | origin | weight | share |
---|---|---|---|
banana | Canada | 10 | 0.666667 |
USA | 5 | 0.333333 | |
orange | Canada | 2 | 0.333333 |
USA | 4 | 0.666667 |
Is there a more pythonic / pandish / cleaner way to achieve the same result. For example, I can't rename weight on the fly in case it's not sum but rather count and I want column name to reflect this.
In R it looks to me much cleaner.
library(dplyr)
df <- tibble(
fruit = c('orange', 'orange', 'orange', 'banana', 'banana', 'banana'),
origin = c('USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'),
weight = c(1, 2, 3, 4, 5, 6)
)
df %>%
group_by(fruit, origin) %>%
summarise(total = sum(weight)) %>%
mutate(share = total / sum(total))
I believe there is some cleaner way to do it in python.
You can have two separate groupby
statements to make it cleaner:
In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [104]: x['share'] = x.groupby('fruit')['weight'].apply(lambda i: i/i.sum())
In [105]: x
Out[105]:
fruit origin weight share
0 banana Canada 10 0.666667
1 banana USA 5 0.333333
2 orange Canada 2 0.333333
3 orange USA 4 0.666667
OR, as per @Manakin's comment, avoiding apply:
In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [109]: x['share'] = x['weight'].div(x.groupby('fruit')['weight'].transform('sum'))
In [110]: x
Out[110]:
fruit origin weight share
0 banana Canada 10 0.666667
1 banana USA 5 0.333333
2 orange Canada 2 0.333333
3 orange USA 4 0.666667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With