In Python, to obtain summaries by group, I use groupby().agg(fx())
; eg groupby('variable').agg('sum')
. What is the difference between that and directly using the function, eg; groupby('variable').sum()
?
Setup
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
The primary benefit of using agg
is stated in the docs:
Aggregate using one or more operations over the specified axis.
If you have separate operations that need to be applied to each individual column, agg
takes a dictionary (or a function, string, or list of strings/functions) that allows you to create that mapping in a single statement. So if you'd like the sum
of column a
, and the mean
of column b
:
df.agg({'a': 'sum', 'b': 'mean'})
a 6.0
b 5.0
dtype: float64
It also allows you to apply multiple operations to a single column in a single statement. For example, to find the sum
, mean
, and std
of column a
:
df.agg({'a': ['sum', 'mean', 'std']})
a
sum 6.0
mean 2.0
std 1.0
There's no difference in outcome when you use agg
with a single operation. I'd argue that df.agg('sum')
is less clear than df.sum()
, but the results will be the same:
df.agg('sum')
a 6
b 15
dtype: int64
df.sum()
a 6
b 15
dtype: int64
The main benefit agg
provides is the convenience of applying multiple operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With