Apply multiple functions to multiple groupby columns

Tags:

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,    .....:                   'result2' : np.mean})    .....: Out[563]:        result2   result1 A                       bar -0.579846 -1.739537 foo -0.280588 -1.402938

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

For example, I've tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),              'C_std': lambda x: x['C'].std(),              'D_sum' : lambda x: x['D'].sum()},              'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?

207

asked Jan 25 '13 20:01

beardc

2 Answers

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd')) df['group'] = [0, 0, 1, 1] df            a         b         c         d  group 0  0.418500  0.030955  0.874869  0.145641      0 1  0.446069  0.901153  0.095052  0.487040      0 2  0.843026  0.936169  0.926090  0.041722      1 3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'],                           'b':'mean',                           'c':'sum',                           'd': lambda x: x.max() - x.min()})                a                   b         c         d             sum       max      mean       sum  <lambda> group                                                   0      0.864569  0.446069  0.466054  0.969921  0.341399 1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):     return x.max() - x.min()  max_min.__name__ = 'Max minus Min'  df.groupby('group').agg({'a':['sum', 'max'],                           'b':'mean',                           'c':'sum',                           'd': max_min})                a                   b         c             d             sum       max      mean       sum Max minus Min group                                                       0      0.864569  0.446069  0.466054  0.969921      0.341399 1      1.478872  0.843026  0.687672  1.754877      0.672401

Using `apply` and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):     d = {}     d['a_sum'] = x['a'].sum()     d['a_max'] = x['a'].max()     d['b_mean'] = x['b'].mean()     d['c_d_prodsum'] = (x['c'] * x['d']).sum()     return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])  df.groupby('group').apply(f)           a_sum     a_max    b_mean  c_d_prodsum group                                            0      0.864569  0.446069  0.466054     0.173711 1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):         d = []         d.append(x['a'].sum())         d.append(x['a'].max())         d.append(x['b'].mean())         d.append((x['c'] * x['d']).sum())         return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],                                     ['sum', 'max', 'mean', 'prodsum']])  df.groupby('group').apply(f_mi)                a                   b       c_d             sum       max      mean   prodsum group                                         0      0.864569  0.446069  0.466054  0.173711 1      1.478872  0.843026  0.687672  0.630494

answered Sep 20 '22 03:09

Ted Petrou

For the first part you can pass a dict of column names for keys and a list of functions for the values:

In [28]: df Out[28]:           A         B         C         D         E  GRP 0  0.395670  0.219560  0.600644  0.613445  0.242893    0 1  0.323911  0.464584  0.107215  0.204072  0.927325    0 2  0.321358  0.076037  0.166946  0.439661  0.914612    1 3  0.133466  0.447946  0.014815  0.130781  0.268290    1  In [26]: f = {'A':['sum','mean'], 'B':['prod']}  In [27]: df.groupby('GRP').agg(f) Out[27]:             A                   B           sum      mean      prod GRP 0    0.719580  0.359790  0.102004 1    0.454824  0.227412  0.034060

UPDATE 1:

Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

Here's a hacky workaround:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}  In [69]: df.groupby('GRP').agg(f) Out[69]:             A                   B         D           sum      mean      prod  <lambda> GRP 0    0.719580  0.359790  0.102004  1.170219 1    0.454824  0.227412  0.034060  1.182901

Here, the resultant 'D' column is made up of the summed 'E' values.

UPDATE 2:

Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()  In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}  In [97]: df.groupby('GRP').agg(f) Out[97]:             A                   B         D           sum      mean      prod   my name GRP 0    0.719580  0.359790  0.102004  0.204072 1    0.454824  0.227412  0.034060  0.570441

answered Sep 21 '22 03:09

Zelazny7

Related questions
                            
                                How to send an email with Gmail as provider using Python?
                            
                                How to initialize a dict with keys from a list and empty value in Python?
                            
                                How to find length of digits in an integer?
                            
                                Difference in boto3 between resource, client, and session?
                            
                                Converting numpy dtypes to native python types
                            
                                Python multiprocessing PicklingError: Can't pickle <type 'function'>
                            
                                Shuffle an array with python, randomize array item order with python
                            
                                How to change plot background color?
                            
                                What do lambda function closures capture?
                            
                                How to use a variable inside a regular expression?
                            
                                All combinations of a list of lists
                            
                                Saving an Object (Data persistence)
                            
                                Cross-platform way of getting temp directory in Python
                            
                                How to pip install a package with min and max version range?
                            
                                Use of "global" keyword in Python
                            
                                How to load all modules in a folder?
                            
                                How do I filter query objects by date range in Django?
                            
                                Add list to set?
                            
                                How to check if a variable is a dictionary in Python?
                            
                                Convert a timedelta to days, hours and minutes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply multiple functions to multiple groupby columns

Tags:

python

pandas

group-by

aggregate-functions

beardc

People also ask

2 Answers

Using `apply` and returning a Series

Ted Petrou

Zelazny7

Recent Activity

Donate For Us

Apply multiple functions to multiple groupby columns

Tags:

python

pandas

group-by

aggregate-functions

beardc

People also ask

2 Answers

Using apply and returning a Series

Ted Petrou

Zelazny7

Related questions

Recent Activity

Donate For Us

Using `apply` and returning a Series