Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply multiple functions to multiple groupby columns

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,    .....:                   'result2' : np.mean})    .....: Out[563]:        result2   result1 A                       bar -0.579846 -1.739537 foo -0.280588 -1.402938 

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

For example, I've tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),              'C_std': lambda x: x['C'].std(),              'D_sum' : lambda x: x['D'].sum()},              'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...) 

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?

like image 207
beardc Avatar asked Jan 25 '13 20:01

beardc


People also ask

How do pandas use two aggregate functions?

To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.

What is AGG in GroupBy?

agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation.

Can you group by multiple columns in SQL?

We can group the resultset in SQL on multiple column values. When we define the grouping criteria on more than one column, all the records having the same value for the columns defined in the group by clause are collectively represented using a single record in the query output.


2 Answers

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd')) df['group'] = [0, 0, 1, 1] df            a         b         c         d  group 0  0.418500  0.030955  0.874869  0.145641      0 1  0.446069  0.901153  0.095052  0.487040      0 2  0.843026  0.936169  0.926090  0.041722      1 3  0.635846  0.439175  0.828787  0.714123      1 

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'],                           'b':'mean',                           'c':'sum',                           'd': lambda x: x.max() - x.min()})                a                   b         c         d             sum       max      mean       sum  <lambda> group                                                   0      0.864569  0.446069  0.466054  0.969921  0.341399 1      1.478872  0.843026  0.687672  1.754877  0.672401 

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):     return x.max() - x.min()  max_min.__name__ = 'Max minus Min'  df.groupby('group').agg({'a':['sum', 'max'],                           'b':'mean',                           'c':'sum',                           'd': max_min})                a                   b         c             d             sum       max      mean       sum Max minus Min group                                                       0      0.864569  0.446069  0.466054  0.969921      0.341399 1      1.478872  0.843026  0.687672  1.754877      0.672401 

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):     d = {}     d['a_sum'] = x['a'].sum()     d['a_max'] = x['a'].max()     d['b_mean'] = x['b'].mean()     d['c_d_prodsum'] = (x['c'] * x['d']).sum()     return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])  df.groupby('group').apply(f)           a_sum     a_max    b_mean  c_d_prodsum group                                            0      0.864569  0.446069  0.466054     0.173711 1      1.478872  0.843026  0.687672     0.630494 

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):         d = []         d.append(x['a'].sum())         d.append(x['a'].max())         d.append(x['b'].mean())         d.append((x['c'] * x['d']).sum())         return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],                                     ['sum', 'max', 'mean', 'prodsum']])  df.groupby('group').apply(f_mi)                a                   b       c_d             sum       max      mean   prodsum group                                         0      0.864569  0.446069  0.466054  0.173711 1      1.478872  0.843026  0.687672  0.630494 
like image 50
Ted Petrou Avatar answered Sep 20 '22 03:09

Ted Petrou


For the first part you can pass a dict of column names for keys and a list of functions for the values:

In [28]: df Out[28]:           A         B         C         D         E  GRP 0  0.395670  0.219560  0.600644  0.613445  0.242893    0 1  0.323911  0.464584  0.107215  0.204072  0.927325    0 2  0.321358  0.076037  0.166946  0.439661  0.914612    1 3  0.133466  0.447946  0.014815  0.130781  0.268290    1  In [26]: f = {'A':['sum','mean'], 'B':['prod']}  In [27]: df.groupby('GRP').agg(f) Out[27]:             A                   B           sum      mean      prod GRP 0    0.719580  0.359790  0.102004 1    0.454824  0.227412  0.034060 

UPDATE 1:

Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

Here's a hacky workaround:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}  In [69]: df.groupby('GRP').agg(f) Out[69]:             A                   B         D           sum      mean      prod  <lambda> GRP 0    0.719580  0.359790  0.102004  1.170219 1    0.454824  0.227412  0.034060  1.182901 

Here, the resultant 'D' column is made up of the summed 'E' values.

UPDATE 2:

Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()  In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}  In [97]: df.groupby('GRP').agg(f) Out[97]:             A                   B         D           sum      mean      prod   my name GRP 0    0.719580  0.359790  0.102004  0.204072 1    0.454824  0.227412  0.034060  0.570441 
like image 25
Zelazny7 Avatar answered Sep 21 '22 03:09

Zelazny7