In a common usage pattern, I need to aggregate a DataFrame using a custom aggregate function. In this special case, the aggregate function needs to know the current group in order to correctly perform the aggregation.
A function passed to DataFrameGroupBy.aggregate()
is called for each group and for each column, receiving the Series with the elements in current group and column.
The only way I found to get the group name from inside the aggregate function is adding the grouping column to the index and then extracting the value with
x.index.get_level_values('power')[0]
. Here an example:
def _tail_mean_user_th(x):
power = x.index.get_level_values('power')[0]
th = th_dict[power] # this values changes with the group
return x.loc[x > th].mean() - th
mbsize_df = (bursts_sel.set_index('power', append=True).groupby('power')
.agg({'nt': _tail_mean_user_th}))
It seems to me that it is a pretty common occurrence that the aggregate function needs to know the current group. Is there a more straightforward pattern in this situation?
EDIT: The solution that I accepted below consists in using apply
instead of agg
on the GroupBy object. The difference between the two is that agg
calls the function for each group and each column separately, while apply
calls the function for each group (all columns at once). A subtle consequence of this is that agg
will pass a Series
for current group and column with its name
attribute equal to the original column name. Conversely, apply
will pass a Series
with a name
attribute equal to the current group (which was my question). Interestingly, when operating on multiple columns, apply
will pass a DataFrame with a name
attribute (normally non-existent for DataFrames) set to the group name. So this pattern also works when aggregating multiple columns at once.
For more info see What is the difference between pandas agg and apply function?
If you use groupby
+ apply
, then it is available through the .name
attribute:
df = pd.DataFrame({'a': [1, 2, 1, 2], 'b': [1, 1, 2, 2]})
def foo(g):
print('at group %s' % g.name)
return int(g.name) + g.sum()
>>> df.b.groupby(df.a).apply(foo)
at group 1
at group 2
a
1 4
2 5
Name: b, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With