The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:
In [563]: grouped['D'].agg({'result1' : np.sum, .....: 'result2' : np.mean}) .....: Out[563]: result2 result1 A bar -0.579846 -1.739537 foo -0.280588 -1.402938
However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.
What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.
For example, I've tried something like
grouped.agg({'C_sum' : lambda x: x['C'].sum(), 'C_std': lambda x: x['C'].std(), 'D_sum' : lambda x: x['D'].sum()}, 'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)
but as expected I get a KeyError (since the keys have to be a column if agg
is called from a DataFrame).
Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?
To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.
agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation.
We can group the resultset in SQL on multiple column values. When we define the grouping criteria on more than one column, all the records having the same value for the columns defined in the group by clause are collectively represented using a single record in the query output.
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg
groupby method. Second, never use .ix
.
If you desire to work with two separate columns at the same time I would suggest using the apply
method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd')) df['group'] = [0, 0, 1, 1] df a b c d group 0 0.418500 0.030955 0.874869 0.145641 0 1 0.446069 0.901153 0.095052 0.487040 0 2 0.843026 0.936169 0.926090 0.041722 1 3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'], 'b':'mean', 'c':'sum', 'd': lambda x: x.max() - x.min()}) a b c d sum max mean sum <lambda> group 0 0.864569 0.446069 0.466054 0.969921 0.341399 1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__
attribute like this:
def max_min(x): return x.max() - x.min() max_min.__name__ = 'Max minus Min' df.groupby('group').agg({'a':['sum', 'max'], 'b':'mean', 'c':'sum', 'd': max_min}) a b c d sum max mean sum Max minus Min group 0 0.864569 0.446069 0.466054 0.969921 0.341399 1 1.478872 0.843026 0.687672 1.754877 0.672401
apply
and returning a SeriesNow, if you had multiple columns that needed to interact together then you cannot use agg
, which implicitly passes a Series to the aggregating function. When using apply
the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x): d = {} d['a_sum'] = x['a'].sum() d['a_max'] = x['a'].max() d['b_mean'] = x['b'].mean() d['c_d_prodsum'] = (x['c'] * x['d']).sum() return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum']) df.groupby('group').apply(f) a_sum a_max b_mean c_d_prodsum group 0 0.864569 0.446069 0.466054 0.173711 1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x): d = [] d.append(x['a'].sum()) d.append(x['a'].max()) d.append(x['b'].mean()) d.append((x['c'] * x['d']).sum()) return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], ['sum', 'max', 'mean', 'prodsum']]) df.groupby('group').apply(f_mi) a b c_d sum max mean prodsum group 0 0.864569 0.446069 0.466054 0.173711 1 1.478872 0.843026 0.687672 0.630494
For the first part you can pass a dict of column names for keys and a list of functions for the values:
In [28]: df Out[28]: A B C D E GRP 0 0.395670 0.219560 0.600644 0.613445 0.242893 0 1 0.323911 0.464584 0.107215 0.204072 0.927325 0 2 0.321358 0.076037 0.166946 0.439661 0.914612 1 3 0.133466 0.447946 0.014815 0.130781 0.268290 1 In [26]: f = {'A':['sum','mean'], 'B':['prod']} In [27]: df.groupby('GRP').agg(f) Out[27]: A B sum mean prod GRP 0 0.719580 0.359790 0.102004 1 0.454824 0.227412 0.034060
UPDATE 1:
Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.
Here's a hacky workaround:
In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()} In [69]: df.groupby('GRP').agg(f) Out[69]: A B D sum mean prod <lambda> GRP 0 0.719580 0.359790 0.102004 1.170219 1 0.454824 0.227412 0.034060 1.182901
Here, the resultant 'D' column is made up of the summed 'E' values.
UPDATE 2:
Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index
to df.ix[]
selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[]
which selects only those rows meeting the criteria.
In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum() In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}} In [97]: df.groupby('GRP').agg(f) Out[97]: A B D sum mean prod my name GRP 0 0.719580 0.359790 0.102004 0.204072 1 0.454824 0.227412 0.034060 0.570441
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With