I have a DataFrame that looks something like this:
A B C D
1 10 22 14
1 12 20 37
1 11 8 18
1 10 10 6
2 11 13 4
2 12 10 12
3 14 0 5
and a function that looks something like this (NOTE: it's actually doing something more complex that can't be easily separated into three independent calls, but I'm simplifying for clarity):
def myfunc(g):
return min(g), mean(g), max(g)
I want to use groupby on A with myfunc to get an output on columns B and C (ignoring D) something like this:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
I can do the following:
df2.groupby('A')[['B','C']].agg(
{
'min': lambda g: myfunc(g)[0],
'mean': lambda g: myfunc(g)[1],
'max': lambda g: myfunc(g)[2]
})
But then—aside from this being ugly and calling myfunc multiple times—I end up with
max mean min
B C B C B C
A
1 12 22 10.75 15.0 10 8
2 12 13 11.50 11.5 11 10
3 14 0 14.00 0.0 14 0
I can use .swaplevel(axis=1) to swap the column levels, but even then B and C are in multiple duplicated columns, and with the multiple function calls it feels like barking up the wrong tree.
If you arrange for myfunc to return a DataFrame whose columns are ['A','B','C','D'] and whose rows index are ['min', 'mean', 'max'], then you could use groupby/apply to call the function (once for each group) and concatenate the results as desired:
import numpy as np
import pandas as pd
def myfunc(g):
result = pd.DataFrame({'min':np.min(g),
'mean':np.mean(g),
'max':np.max(g)}).T
return result
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3],
'B': [10, 12, 11, 10, 11, 12, 14],
'C': [22, 20, 8, 10, 13, 10, 0],
'D': [14, 37, 18, 6, 4, 12, 5]})
result = df.groupby('A')[['B','C']].apply(myfunc)
result = result.unstack(level=-1)
print(result)
prints
B C
max mean min max mean min
A
1 12.0 10.75 10.0 22.0 15.0 8.0
2 12.0 11.50 11.0 13.0 11.5 10.0
3 14.0 14.00 14.0 0.0 0.0 0.0
For others who may run across this and who do not need a custom function, note
that it behooves you to always use builtin aggregators (below, specified by the
strings 'min', 'mean' and 'max') if possible. They perform better than
custom Python functions. Happily, in this toy problem, it produces the desired result:
In [99]: df.groupby('A')[['B','C']].agg(['min','mean','max'])
Out[99]:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With