Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple aggregations of the same column using pandas GroupBy.agg()

People also ask

Do multiple aggregations Groupby pandas?

To apply more than one aggregation when using pandas GroupBy, you simply pass in a dictionary to the . agg function. In your dictionary, your key will be the column name and the value will be a list of operations you want to perform on the column. The result will be a DataFrame with a MultiIndex column.

Can you use Groupby with multiple columns in pandas?

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.

What does the AGG () method do?

The agg() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.

What is AGG in Groupby?

agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation.


You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

TLDR; Pandas groupby.agg has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

OR

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

Pandas >= 0.25: Named Aggregation

Pandas has changed the behavior of GroupBy.agg in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.

From the documentation,

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>).

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

Alternatively, you can use pd.NamedAgg (essentially a namedtuple) which makes things more explicit.

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

It is even simpler for Series, just pass the aggfunc to a keyword argument.

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

Lastly, if your column names aren't valid python identifiers, use a dictionary with unpacking:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

Pandas < 0.25

In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning:

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

Or,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

Would something like this work:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565