Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas group by, aggregate using multiple agg functions on input columns

Tags:

python

pandas

r

I am looking to do some aggregation on a pandas groupby dataframe, where I need to apply several different custom functions on multiple columns. This operation is very easy and customary in R (using data.table or dplyr), but I am surprised I'm finding it so difficult in pandas:

import pandas as pd
data = pd.DataFrame({'A':[1,2,3,4,5,6],'B':[2,4,6,8,10,12],'C':[1,1,1,2,2,2]})

#These work
data.groupby('C').apply(lambda x: x.A.mean() - x.B.mean())
data.groupby('C').agg(['mean','std'])

#but this doesn't
data.groupby('C').agg([lambda x: x.A.mean() - x.B.mean(),
                       lambda x: len(x.A)])

I want to calculate a statistic but also the sample size in each group, which seems like it should be a one or two line solution, but I also sometimes need to apply multiple functions on multiple columns of the grouped data frame.

like image 529
Allen Wang Avatar asked Feb 04 '23 10:02

Allen Wang


2 Answers

If you need a one-liner, you can do this:

#use apply instead of agg to create multiple columns
data.groupby('C').apply(lambda x: pd.Series([x.A.mean() - x.B.mean(), len(x.A)])).rename(columns={0:'diff',1:'a_len'})
Out[2346]: 
   diff  a_len
C             
1  -2.0    3.0
2  -5.0    3.0

Another solution without using rename.

data.groupby('C').apply(lambda x: pd.DataFrame([[x.A.mean() - x.B.mean(), len(x.A)]],columns=['diff','a_len']))
Out[24]: 
     diff  a_len
C               
1 0  -2.0      3
2 0  -5.0      3
like image 92
Allen Avatar answered Feb 07 '23 01:02

Allen


We can write a function that does the custom functions on multiple columns and returns the result as a data frame.

>>> def meandiff_length(data):
            data['mean_diff'] = data.A.mean() - data.B.mean()
            data['a_length'] = len(data.A)
            return data

We can group the data and apply the custom function to the groups separately.

>>> data.groupby('C').apply(meandiff_length)
   A   B  C  mean_diff  a_length
0  1   2  1       -2.0         3
1  2   4  1       -2.0         3
2  3   6  1       -2.0         3
3  4   8  2       -5.0         3
4  5  10  2       -5.0         3
5  6  12  2       -5.0         3

This specific custom function returns the same value in every row, so it may be of your interest to use drop_duplicates. However, this is a general solution that will also work when our custom function becomes more complex.

like image 21
spies006 Avatar answered Feb 07 '23 00:02

spies006