I have a DataFrame that looks something like this:
A B C D
1 10 22 14
1 12 20 37
1 11 8 18
1 10 10 6
2 11 13 4
2 12 10 12
3 14 0 5
and a function that looks something like this (NOTE: it's actually doing something more complex that can't be easily separated into three independent calls, but I'm simplifying for clarity):
def myfunc(g):
return min(g), mean(g), max(g)
I want to use groupby
on A
with myfunc
to get an output on columns B
and C
(ignoring D
) something like this:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
I can do the following:
df2.groupby('A')[['B','C']].agg(
{
'min': lambda g: myfunc(g)[0],
'mean': lambda g: myfunc(g)[1],
'max': lambda g: myfunc(g)[2]
})
But then—aside from this being ugly and calling myfunc
multiple times—I end up with
max mean min
B C B C B C
A
1 12 22 10.75 15.0 10 8
2 12 13 11.50 11.5 11 10
3 14 0 14.00 0.0 14 0
I can use .swaplevel(axis=1)
to swap the column levels, but even then B
and C
are in multiple duplicated columns, and with the multiple function calls it feels like barking up the wrong tree.
If you arrange for myfunc
to return a DataFrame whose columns are ['A','B','C','D']
and whose rows index are ['min', 'mean', 'max']
, then you could use groupby/apply
to call the function (once for each group) and concatenate the results as desired:
import numpy as np
import pandas as pd
def myfunc(g):
result = pd.DataFrame({'min':np.min(g),
'mean':np.mean(g),
'max':np.max(g)}).T
return result
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3],
'B': [10, 12, 11, 10, 11, 12, 14],
'C': [22, 20, 8, 10, 13, 10, 0],
'D': [14, 37, 18, 6, 4, 12, 5]})
result = df.groupby('A')[['B','C']].apply(myfunc)
result = result.unstack(level=-1)
print(result)
prints
B C
max mean min max mean min
A
1 12.0 10.75 10.0 22.0 15.0 8.0
2 12.0 11.50 11.0 13.0 11.5 10.0
3 14.0 14.00 14.0 0.0 0.0 0.0
For others who may run across this and who do not need a custom function, note
that it behooves you to always use builtin aggregators (below, specified by the
strings 'min'
, 'mean'
and 'max'
) if possible. They perform better than
custom Python functions. Happily, in this toy problem, it produces the desired result:
In [99]: df.groupby('A')[['B','C']].agg(['min','mean','max'])
Out[99]:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With