I'm grouping a dataframe by multiple columns and aggregating to obtain multiple statistics. How to obtain a totally flat structure with each possible combination of group-keys enumerated as rows and each statistic present as columns?
import numpy as np
import pandas as pd
cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']
data = pd.DataFrame({
'city': np.random.choice(cities, 12),
'day': np.random.choice(days, 12),
'people': np.random.normal(loc=10, size=12),
'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg([np.mean, np.std])
This way I'm getting:
cats people
mean std mean std
city day
Berlin Friday 6.146924 0.721263 10.445606 0.730992
Monday 5.239267 NaN 9.022811 NaN
Oslo Friday 6.322276 0.866899 11.579813 0.114341
Monday 5.028919 0.815674 10.458439 1.182689
I need to get it flat:
city day cats_mean cats_std people_mean people_std
Berlin Friday 6.146924 0.721263 10.445606 0.730992
Berlin Monday 5.239267 NaN 9.022811 NaN
Oslo Friday 6.322276 0.866899 11.579813 0.114341
Oslo Monday 5.028919 0.815674 10.458439 1.182689
In [36]: grouped.columns = grouped.columns.map('_'.join)
In [37]: grouped = grouped.reset_index()
In [38]: grouped
Out[38]:
city day cats_mean cats_std people_mean people_std
0 Berlin Friday 5.852991 1.085163 11.078541 0.839688
1 Berlin Monday 6.978343 0.630983 9.876106 1.846204
2 Oslo Friday 6.096773 1.278176 9.710216 0.691672
Update using pd.NamedAgg
import numpy as np
import pandas as pd
cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']
data = pd.DataFrame({
'city': np.random.choice(cities, 12),
'day': np.random.choice(days, 12),
'people': np.random.normal(loc=10, size=12),
'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg(cats_mean=('cats', np.mean),
cats_std=('cats', np.std),
people_mean=('people', np.mean),
people_std=('people', np.std))
grouped.reset_index()
Nest renaming is deprecated
You can use a dictionary in .agg to rename your columns then drop column level and reset_index(): See this SO Post
Output:
city day people_mean people_std cats_std cats_mean
0 Berlin Friday 9.645190 0.699684 0.973866 6.478510
1 Berlin Monday 9.556898 0.126810 0.336654 6.624288
2 Oslo Friday 11.593491 NaN NaN 6.206595
3 Oslo Monday 10.202183 1.058651 0.657939 6.019748
You can perform a list comprehension on the column levels and join with an underscore and then call reset_index
:
In [39]:
grouped.columns= ['_'.join(x) for x in list(zip(grouped.columns.get_level_values(0), grouped.columns.get_level_values(1)))]
grouped = grouped.reset_index()
grouped
Out[39]:
city day cats_mean cats_std people_mean people_std
0 Berlin Friday 6.140710 0.555981 10.187634 0.359724
1 Berlin Monday 6.420175 0.986568 10.134376 0.963938
2 Oslo Friday 6.978572 0.573297 11.345484 1.454762
3 Oslo Monday 4.594814 NaN 10.842988 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With