Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby aggregate passing group name to aggregate

In a common usage pattern, I need to aggregate a DataFrame using a custom aggregate function. In this special case, the aggregate function needs to know the current group in order to correctly perform the aggregation.

A function passed to DataFrameGroupBy.aggregate() is called for each group and for each column, receiving the Series with the elements in current group and column. The only way I found to get the group name from inside the aggregate function is adding the grouping column to the index and then extracting the value with x.index.get_level_values('power')[0]. Here an example:

def _tail_mean_user_th(x):
    power = x.index.get_level_values('power')[0]
    th = th_dict[power]  # this values changes with the group
    return x.loc[x > th].mean() - th

mbsize_df = (bursts_sel.set_index('power', append=True).groupby('power')
             .agg({'nt': _tail_mean_user_th}))

It seems to me that it is a pretty common occurrence that the aggregate function needs to know the current group. Is there a more straightforward pattern in this situation?


EDIT: The solution that I accepted below consists in using apply instead of agg on the GroupBy object. The difference between the two is that agg calls the function for each group and each column separately, while apply calls the function for each group (all columns at once). A subtle consequence of this is that agg will pass a Series for current group and column with its name attribute equal to the original column name. Conversely, apply will pass a Series with a name attribute equal to the current group (which was my question). Interestingly, when operating on multiple columns, apply will pass a DataFrame with a name attribute (normally non-existent for DataFrames) set to the group name. So this pattern also works when aggregating multiple columns at once.

For more info see What is the difference between pandas agg and apply function?

like image 615
user2304916 Avatar asked Jan 03 '23 10:01

user2304916


1 Answers

If you use groupby + apply, then it is available through the .name attribute:

df = pd.DataFrame({'a': [1, 2, 1, 2], 'b': [1, 1, 2, 2]})
def foo(g):
    print('at group %s' % g.name)
    return int(g.name) + g.sum()    

>>> df.b.groupby(df.a).apply(foo)
at group 1
at group 2
a
1    4
2    5
Name: b, dtype: int64
like image 141
Ami Tavory Avatar answered Jan 05 '23 17:01

Ami Tavory