list
s, tuple
s, strings with separator
)?I've seen these recurring questions asking about various faces of the pandas aggregate functionality. Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next instalment in a series of helpful user-guides:
Please note that this post is not meant to be a replacement for the documentation about aggregation and about groupby, so please read that as well!
Expanded aggregation documentation.
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.
Some common aggregating functions are tabulated below:
Function Description mean() Compute mean of groups sum() Compute sum of group values size() Compute group sizes count() Compute count of group std() Standard deviation of groups var() Compute variance of groups sem() Standard error of the mean of groups describe() Generates descriptive statistics first() Compute first of group values last() Compute last of group values nth() Take nth value, or a subset if n is a list min() Compute min of group values max() Compute max of group values
np.random.seed(123) df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'], 'B' : ['one', 'two', 'three','two', 'two', 'one'], 'C' : np.random.randint(5, size=6), 'D' : np.random.randint(5, size=6), 'E' : np.random.randint(5, size=6)}) print (df) A B C D E 0 foo one 2 3 0 1 foo two 4 1 0 2 bar three 2 1 1 3 foo two 1 0 3 4 bar two 3 1 4 5 foo one 2 1 0
Aggregation by filtered columns and Cython implemented functions:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum() print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5
An aggregate function is used for all columns without being specified in the groupby
function, here the A, B
columns:
df2 = df.groupby(['A', 'B'], as_index=False).sum() print (df2) A B C D E 0 bar three 2 1 1 1 bar two 3 1 4 2 foo one 4 4 0 3 foo two 5 1 3
You can also specify only some columns used for aggregation in a list after the groupby
function:
df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum() print (df3) A B C D 0 bar three 2 1 1 bar two 3 1 2 foo one 4 4 3 foo two 5 1
Same results by using function DataFrameGroupBy.agg
:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum') print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5 df2 = df.groupby(['A', 'B'], as_index=False).agg('sum') print (df2) A B C D E 0 bar three 2 1 1 1 bar two 3 1 4 2 foo one 4 4 0 3 foo two 5 1 3
For multiple functions applied for one column use a list of tuple
s - names of new columns and aggregated functions:
df4 = (df.groupby(['A', 'B'])['C'] .agg([('average','mean'),('total','sum')]) .reset_index()) print (df4) A B average total 0 bar three 2.0 2 1 bar two 3.0 3 2 foo one 2.0 4 3 foo two 2.5 5
If want to pass multiple functions is possible pass list
of tuple
s:
df5 = (df.groupby(['A', 'B']) .agg([('average','mean'),('total','sum')])) print (df5) C D E average total average total average total A B bar three 2.0 2 1.0 1 1.0 1 two 3.0 3 1.0 1 4.0 4 foo one 2.0 4 2.0 4 0.0 0 two 2.5 5 0.5 1 1.5 3
Then get MultiIndex
in columns:
print (df5.columns) MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
And for converting to columns, flattening MultiIndex
use map
with join
:
df5.columns = df5.columns.map('_'.join) df5 = df5.reset_index() print (df5) A B C_average C_total D_average D_total E_average E_total 0 bar three 2.0 2 1.0 1 1.0 1 1 bar two 3.0 3 1.0 1 4.0 4 2 foo one 2.0 4 2.0 4 0.0 0 3 foo two 2.5 5 0.5 1 1.5 3
Another solution is pass list of aggregate functions, then flatten MultiIndex
and for another columns names use str.replace
:
df5 = df.groupby(['A', 'B']).agg(['mean','sum']) df5.columns = (df5.columns.map('_'.join) .str.replace('sum','total') .str.replace('mean','average')) df5 = df5.reset_index() print (df5) A B C_average C_total D_average D_total E_average E_total 0 bar three 2.0 2 1.0 1 1.0 1 1 bar two 3.0 3 1.0 1 4.0 4 2 foo one 2.0 4 2.0 4 0.0 0 3 foo two 2.5 5 0.5 1 1.5 3
If want specified each column with aggregated function separately pass dictionary
:
df6 = (df.groupby(['A', 'B'], as_index=False) .agg({'C':'sum','D':'mean'}) .rename(columns={'C':'C_total', 'D':'D_average'})) print (df6) A B C_total D_average 0 bar three 2 1.0 1 bar two 3 1.0 2 foo one 4 2.0 3 foo two 5 0.5
You can pass custom function too:
def func(x): return x.iat[0] + x.iat[-1] df7 = (df.groupby(['A', 'B'], as_index=False) .agg({'C':'sum','D': func}) .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'})) print (df7) A B C_total D_sum_first_and_last 0 bar three 2 2 1 bar two 3 2 2 foo one 4 4 3 foo two 5 1
Aggregation by two or more columns:
df1 = df.groupby(['A', 'B'])['C'].sum() print (df1) A B bar three 2 two 3 foo one 4 two 5 Name: C, dtype: int32
First check the Index
and type
of a Pandas object:
print (df1.index) MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']], labels=[[0, 0, 1, 1], [1, 2, 0, 2]], names=['A', 'B']) print (type(df1)) <class 'pandas.core.series.Series'>
There are two solutions for how to get MultiIndex Series
to columns:
as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum() print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5
Series.reset_index
:df1 = df.groupby(['A', 'B'])['C'].sum().reset_index() print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5
If group by one column:
df2 = df.groupby('A')['C'].sum() print (df2) A bar 5 foo 9 Name: C, dtype: int32
... get Series
with Index
:
print (df2.index) Index(['bar', 'foo'], dtype='object', name='A') print (type(df2)) <class 'pandas.core.series.Series'>
And the solution is the same like in the MultiIndex Series
:
df2 = df.groupby('A', as_index=False)['C'].sum() print (df2) A C 0 bar 5 1 foo 9 df2 = df.groupby('A')['C'].sum().reset_index() print (df2) A C 0 bar 5 1 foo 9
list
s, tuple
s, strings with separator
)? df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'], 'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'], 'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'], 'D' : [1,2,3,2,3,1,2]}) print (df) A B C D 0 a one three 1 1 c two one 2 2 b three two 3 3 b two two 2 4 a two three 3 5 c one two 1 6 b three one 2
Instead of an aggregation function, it is possible to pass list
, tuple
, set
for converting the column:
df1 = df.groupby('A')['B'].agg(list).reset_index() print (df1) A B 0 a [one, two] 1 b [three, two, three] 2 c [two, one]
An alternative is use GroupBy.apply
:
df1 = df.groupby('A')['B'].apply(list).reset_index() print (df1) A B 0 a [one, two] 1 b [three, two, three] 2 c [two, one]
For converting to strings with a separator, use .join
only if it is a string column:
df2 = df.groupby('A')['B'].agg(','.join).reset_index() print (df2) A B 0 a one,two 1 b three,two,three 2 c two,one
If it is a numeric column, use a lambda function with astype
for converting to string
s:
df3 = (df.groupby('A')['D'] .agg(lambda x: ','.join(x.astype(str))) .reset_index()) print (df3) A D 0 a 1,3 1 b 3,2,2 2 c 2,1
Another solution is converting to strings before groupby
:
df3 = (df.assign(D = df['D'].astype(str)) .groupby('A')['D'] .agg(','.join).reset_index()) print (df3) A D 0 a 1,3 1 b 3,2,2 2 c 2,1
For converting all columns, don't pass a list of column(s) after groupby
. There isn't any column D
, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.
df4 = df.groupby('A').agg(','.join).reset_index() print (df4) A B C 0 a one,two three,three 1 b three,two,three two,two,one 2 c two,one one,two
So it's necessary to convert all columns into strings, and then get all columns:
df5 = (df.groupby('A') .agg(lambda x: ','.join(x.astype(str))) .reset_index()) print (df5) A B C D 0 a one,two three,three 1,3 1 b three,two,three two,two,one 3,2,2 2 c two,one one,two 2,1
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'], 'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'], 'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'], 'D' : [np.nan,2,3,2,3,np.nan,2]}) print (df) A B C D 0 a one three NaN 1 c two NaN 2.0 2 b three NaN 3.0 3 b two two 2.0 4 a two three 3.0 5 c one two NaN 6 b three one 2.0
Function GroupBy.size
for size
of each group:
df1 = df.groupby('A').size().reset_index(name='COUNT') print (df1) A COUNT 0 a 2 1 b 3 2 c 2
Function GroupBy.count
excludes missing values:
df2 = df.groupby('A')['C'].count().reset_index(name='COUNT') print (df2) A COUNT 0 a 2 1 b 2 2 c 1
This function should be used for multiple columns for counting non-missing values:
df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index() print (df3) A B_COUNT C_COUNT D_COUNT 0 a 2 2 1 1 b 3 2 3 2 c 2 1 1
A related function is Series.value_counts
. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaN
s values by default.
df4 = (df['A'].value_counts() .rename_axis('A') .reset_index(name='COUNT')) print (df4) A COUNT 0 b 3 1 a 2 2 c 2
If you want same output like using function groupby
+ size
, add Series.sort_index
:
df5 = (df['A'].value_counts() .sort_index() .rename_axis('A') .reset_index(name='COUNT')) print (df5) A COUNT 0 a 2 1 b 3 2 c 2
Method GroupBy.transform
returns an object that is indexed the same (same size) as the one being grouped.
See the Pandas documentation for more information.
np.random.seed(123) df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'], 'B' : ['one', 'two', 'three','two', 'two', 'one'], 'C' : np.random.randint(5, size=6), 'D' : np.random.randint(5, size=6)}) print (df) A B C D 0 foo one 2 3 1 foo two 4 1 2 bar three 2 1 3 foo two 1 0 4 bar two 3 1 5 foo one 2 1 df['C1'] = df.groupby('A')['C'].transform('sum') df['C2'] = df.groupby(['A','B'])['C'].transform('sum') df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum') df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum') print (df) A B C D C1 C2 C3 D3 C4 D4 0 foo one 2 3 9 4 9 5 4 4 1 foo two 4 1 9 5 9 5 5 1 2 bar three 2 1 5 2 5 2 2 1 3 foo two 1 0 9 5 9 5 5 1 4 bar two 3 1 5 3 5 2 3 1 5 foo one 2 1 9 4 9 5 4 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With