For example, I have two lambda functions to apply to a grouped data frame:
df.groupby(['A', 'B']).apply(lambda g: ...)
df.groupby(['A', 'B']).apply(lambda g: ...)
Both would work, but not when combined:
df.groupby(['A', 'B']).apply([lambda g: ..., lambda g: ...])
Why is that? How can I apply different functions to a grouped object and get each result concatenated column wise together?
Is there a way not to specify some column to a function? All you have suggested seemed to only work with certain columns.
To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.
This is a good opportunity to highlight one of the changes in pandas 0.20
Deprecate groupby.agg() with a dictionary when renaming
What does this mean?
Consider the dataframe df
df = pd.DataFrame(dict(
A=np.tile([1, 2], 2).repeat(2),
B=np.repeat([1, 2], 2).repeat(2),
C=np.arange(8)
))
df
A B C
0 1 1 0
1 1 1 1
2 2 1 2
3 2 1 3
4 1 2 4
5 1 2 5
6 2 2 6
7 2 2 7
We could previously do
df.groupby(['A', 'B']).C.agg(dict(f1=lambda x: x.size, f2=lambda x: x.max()))
f1 f2
A B
1 1 2 1
2 2 5
2 1 2 3
2 2 7
And our names 'f1'
and 'f2'
were placed as column headers. However, with pandas 0.20 I get this
//anaconda/envs/3.6/lib/python3.6/site-packages/ipykernel/__main__.py:1: FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version if __name__ == '__main__':
So what does that mean? What if I do two lambdas
without the naming dictionary?
df.groupby(['A', 'B']).C.agg([lambda x: x.size, lambda x: x.max()])
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
<ipython-input-398-fc26cf466812> in <module>()
----> 1 print(df.groupby(['A', 'B']).C.agg([lambda x: x.size, lambda x: x.max()]))
//anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
2798 if hasattr(func_or_funcs, '__iter__'):
2799 ret = self._aggregate_multiple_funcs(func_or_funcs,
-> 2800 (_level or 0) + 1)
2801 else:
2802 cyfunc = self._is_cython_func(func_or_funcs)
//anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_multiple_funcs(self, arg, _level)
2863 if name in results:
2864 raise SpecificationError('Function names must be unique, '
-> 2865 'found multiple named %s' % name)
2866
2867 # reset the cache so that we
SpecificationError: Function names must be unique, found multiple named <lambda>
pandas errors on multiple columns named '<lambda>'
Solution: Name your functions
def f1(x):
return x.size
def f2(x):
return x.max()
df.groupby(['A', 'B']).C.agg([f1, f2])
f1 f2
A B
1 1 2 1
2 2 5
2 1 2 3
2 2 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With