Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby aggregate to new columns

I have a DataFrame that looks something like this:

A    B    C   D
1    10   22  14
1    12   20  37
1    11   8   18
1    10   10  6
2    11   13  4
2    12   10  12
3    14   0   5

and a function that looks something like this (NOTE: it's actually doing something more complex that can't be easily separated into three independent calls, but I'm simplifying for clarity):

def myfunc(g):
    return min(g), mean(g), max(g)

I want to use groupby on A with myfunc to get an output on columns B and C (ignoring D) something like this:

                B               C
   min  mean  max  min  mean  max
A
1  10  10.75  12     8  15.0   22
2  11  11.50  12    10  11.5   13
3  14  14.00  14     0   0.0    0

I can do the following:

df2.groupby('A')[['B','C']].agg(
    {
        'min':  lambda g: myfunc(g)[0],
        'mean': lambda g: myfunc(g)[1],
        'max':  lambda g: myfunc(g)[2]
    })

But then—aside from this being ugly and calling myfunc multiple times—I end up with

  max       mean       min
    B   C      B     C   B   C
A
1  12  22  10.75  15.0  10   8
2  12  13  11.50  11.5  11  10
3  14   0  14.00   0.0  14   0

I can use .swaplevel(axis=1) to swap the column levels, but even then B and C are in multiple duplicated columns, and with the multiple function calls it feels like barking up the wrong tree.

like image 855
Dan Avatar asked Jan 04 '23 10:01

Dan


1 Answers

If you arrange for myfunc to return a DataFrame whose columns are ['A','B','C','D'] and whose rows index are ['min', 'mean', 'max'], then you could use groupby/apply to call the function (once for each group) and concatenate the results as desired:

import numpy as np
import pandas as pd

def myfunc(g):
    result = pd.DataFrame({'min':np.min(g),
                           'mean':np.mean(g),
                           'max':np.max(g)}).T
    return result

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3],
 'B': [10, 12, 11, 10, 11, 12, 14],
 'C': [22, 20, 8, 10, 13, 10, 0],
 'D': [14, 37, 18, 6, 4, 12, 5]})

result = df.groupby('A')[['B','C']].apply(myfunc)
result = result.unstack(level=-1)
print(result)

prints

      B                  C            
    max   mean   min   max  mean   min
A                                     
1  12.0  10.75  10.0  22.0  15.0   8.0
2  12.0  11.50  11.0  13.0  11.5  10.0
3  14.0  14.00  14.0   0.0   0.0   0.0

For others who may run across this and who do not need a custom function, note that it behooves you to always use builtin aggregators (below, specified by the strings 'min', 'mean' and 'max') if possible. They perform better than custom Python functions. Happily, in this toy problem, it produces the desired result:

In [99]: df.groupby('A')[['B','C']].agg(['min','mean','max'])
Out[99]: 
    B              C          
  min   mean max min  mean max
A                             
1  10  10.75  12   8  15.0  22
2  11  11.50  12  10  11.5  13
3  14  14.00  14   0   0.0   0
like image 77
unutbu Avatar answered Jan 07 '23 00:01

unutbu