Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregation in Pandas

  1. How can I perform aggregation with Pandas?
  2. No DataFrame after aggregation! What happened?
  3. How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?
  4. How can I aggregate counts?
  5. How can I create a new column filled by aggregated values?

I've seen these recurring questions asking about various faces of the pandas aggregate functionality. Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This Q&A is meant to be the next instalment in a series of helpful user-guides:

  • How to pivot a dataframe,
  • Pandas concat
  • How do I operate on a DataFrame with a Series for every column?
  • Pandas Merging 101

Please note that this post is not meant to be a replacement for the documentation about aggregation and about groupby, so please read that as well!

like image 807
jezrael Avatar asked Dec 14 '18 14:12

jezrael


1 Answers

Question 1

How can I perform aggregation with Pandas?

Expanded aggregation documentation.

Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.

Some common aggregating functions are tabulated below:

 Function    Description mean()         Compute mean of groups sum()         Compute sum of group values size()         Compute group sizes count()     Compute count of group std()         Standard deviation of groups var()         Compute variance of groups sem()         Standard error of the mean of groups describe()     Generates descriptive statistics first()     Compute first of group values last()         Compute last of group values nth()         Take nth value, or a subset if n is a list min()         Compute min of group values max()         Compute max of group values 
np.random.seed(123)  df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],                    'B' : ['one', 'two', 'three','two', 'two', 'one'],                    'C' : np.random.randint(5, size=6),                    'D' : np.random.randint(5, size=6),                    'E' : np.random.randint(5, size=6)}) print (df)      A      B  C  D  E 0  foo    one  2  3  0 1  foo    two  4  1  0 2  bar  three  2  1  1 3  foo    two  1  0  3 4  bar    two  3  1  4 5  foo    one  2  1  0 

Aggregation by filtered columns and Cython implemented functions:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum() print (df1)      A      B  C 0  bar  three  2 1  bar    two  3 2  foo    one  4 3  foo    two  5 

An aggregate function is used for all columns without being specified in the groupby function, here the A, B columns:

df2 = df.groupby(['A', 'B'], as_index=False).sum() print (df2)      A      B  C  D  E 0  bar  three  2  1  1 1  bar    two  3  1  4 2  foo    one  4  4  0 3  foo    two  5  1  3 

You can also specify only some columns used for aggregation in a list after the groupby function:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum() print (df3)      A      B  C  D 0  bar  three  2  1 1  bar    two  3  1 2  foo    one  4  4 3  foo    two  5  1 

Same results by using function DataFrameGroupBy.agg:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum') print (df1)      A      B  C 0  bar  three  2 1  bar    two  3 2  foo    one  4 3  foo    two  5  df2 = df.groupby(['A', 'B'], as_index=False).agg('sum') print (df2)      A      B  C  D  E 0  bar  three  2  1  1 1  bar    two  3  1  4 2  foo    one  4  4  0 3  foo    two  5  1  3 

For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:

df4 = (df.groupby(['A', 'B'])['C']          .agg([('average','mean'),('total','sum')])          .reset_index()) print (df4)      A      B  average  total 0  bar  three      2.0      2 1  bar    two      3.0      3 2  foo    one      2.0      4 3  foo    two      2.5      5 

If want to pass multiple functions is possible pass list of tuples:

df5 = (df.groupby(['A', 'B'])          .agg([('average','mean'),('total','sum')]))  print (df5)                 C             D             E           average total average total average total A   B bar three     2.0     2     1.0     1     1.0     1     two       3.0     3     1.0     1     4.0     4 foo one       2.0     4     2.0     4     0.0     0     two       2.5     5     0.5     1     1.5     3 

Then get MultiIndex in columns:

print (df5.columns) MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],            labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]) 

And for converting to columns, flattening MultiIndex use map with join:

df5.columns = df5.columns.map('_'.join) df5 = df5.reset_index() print (df5)      A      B  C_average  C_total  D_average  D_total  E_average  E_total 0  bar  three        2.0        2        1.0        1        1.0        1 1  bar    two        3.0        3        1.0        1        4.0        4 2  foo    one        2.0        4        2.0        4        0.0        0 3  foo    two        2.5        5        0.5        1        1.5        3 

Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])  df5.columns = (df5.columns.map('_'.join)                   .str.replace('sum','total')                   .str.replace('mean','average')) df5 = df5.reset_index() print (df5)      A      B  C_average  C_total  D_average  D_total  E_average  E_total 0  bar  three        2.0        2        1.0        1        1.0        1 1  bar    two        3.0        3        1.0        1        4.0        4 2  foo    one        2.0        4        2.0        4        0.0        0 3  foo    two        2.5        5        0.5        1        1.5        3 

If want specified each column with aggregated function separately pass dictionary:

df6 = (df.groupby(['A', 'B'], as_index=False)          .agg({'C':'sum','D':'mean'})          .rename(columns={'C':'C_total', 'D':'D_average'})) print (df6)      A      B  C_total  D_average 0  bar  three        2        1.0 1  bar    two        3        1.0 2  foo    one        4        2.0 3  foo    two        5        0.5 

You can pass custom function too:

def func(x):     return x.iat[0] + x.iat[-1]  df7 = (df.groupby(['A', 'B'], as_index=False)          .agg({'C':'sum','D': func})          .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'})) print (df7)      A      B  C_total  D_sum_first_and_last 0  bar  three        2                     2 1  bar    two        3                     2 2  foo    one        4                     4 3  foo    two        5                     1 

Question 2

No DataFrame after aggregation! What happened?

Aggregation by two or more columns:

df1 = df.groupby(['A', 'B'])['C'].sum() print (df1) A    B bar  three    2      two      3 foo  one      4      two      5 Name: C, dtype: int32 

First check the Index and type of a Pandas object:

print (df1.index) MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],            labels=[[0, 0, 1, 1], [1, 2, 0, 2]],            names=['A', 'B'])  print (type(df1)) <class 'pandas.core.series.Series'> 

There are two solutions for how to get MultiIndex Series to columns:

  • add parameter as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum() print (df1)      A      B  C 0  bar  three  2 1  bar    two  3 2  foo    one  4 3  foo    two  5 
  • use Series.reset_index:
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index() print (df1)      A      B  C 0  bar  three  2 1  bar    two  3 2  foo    one  4 3  foo    two  5 

If group by one column:

df2 = df.groupby('A')['C'].sum() print (df2) A bar    5 foo    9 Name: C, dtype: int32 

... get Series with Index:

print (df2.index) Index(['bar', 'foo'], dtype='object', name='A')  print (type(df2)) <class 'pandas.core.series.Series'> 

And the solution is the same like in the MultiIndex Series:

df2 = df.groupby('A', as_index=False)['C'].sum() print (df2)      A  C 0  bar  5 1  foo  9  df2 = df.groupby('A')['C'].sum().reset_index() print (df2)      A  C 0  bar  5 1  foo  9 

Question 3

How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],                    'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],                    'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],                    'D' : [1,2,3,2,3,1,2]}) print (df)    A      B      C  D 0  a    one  three  1 1  c    two    one  2 2  b  three    two  3 3  b    two    two  2 4  a    two  three  3 5  c    one    two  1 6  b  three    one  2 

Instead of an aggregation function, it is possible to pass list, tuple, set for converting the column:

df1 = df.groupby('A')['B'].agg(list).reset_index() print (df1)    A                    B 0  a           [one, two] 1  b  [three, two, three] 2  c           [two, one] 

An alternative is use GroupBy.apply:

df1 = df.groupby('A')['B'].apply(list).reset_index() print (df1)    A                    B 0  a           [one, two] 1  b  [three, two, three] 2  c           [two, one] 

For converting to strings with a separator, use .join only if it is a string column:

df2 = df.groupby('A')['B'].agg(','.join).reset_index() print (df2)    A                B 0  a          one,two 1  b  three,two,three 2  c          two,one 

If it is a numeric column, use a lambda function with astype for converting to strings:

df3 = (df.groupby('A')['D']          .agg(lambda x: ','.join(x.astype(str)))          .reset_index()) print (df3)    A      D 0  a    1,3 1  b  3,2,2 2  c    2,1 

Another solution is converting to strings before groupby:

df3 = (df.assign(D = df['D'].astype(str))          .groupby('A')['D']          .agg(','.join).reset_index()) print (df3)    A      D 0  a    1,3 1  b  3,2,2 2  c    2,1 

For converting all columns, don't pass a list of column(s) after groupby. There isn't any column D, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.

df4 = df.groupby('A').agg(','.join).reset_index() print (df4)    A                B            C 0  a          one,two  three,three 1  b  three,two,three  two,two,one 2  c          two,one      one,two 

So it's necessary to convert all columns into strings, and then get all columns:

df5 = (df.groupby('A')          .agg(lambda x: ','.join(x.astype(str)))          .reset_index()) print (df5)    A                B            C      D 0  a          one,two  three,three    1,3 1  b  three,two,three  two,two,one  3,2,2 2  c          two,one      one,two    2,1 

Question 4

How can I aggregate counts?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],                    'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],                    'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],                    'D' : [np.nan,2,3,2,3,np.nan,2]}) print (df)    A      B      C    D 0  a    one  three  NaN 1  c    two    NaN  2.0 2  b  three    NaN  3.0 3  b    two    two  2.0 4  a    two  three  3.0 5  c    one    two  NaN 6  b  three    one  2.0 

Function GroupBy.size for size of each group:

df1 = df.groupby('A').size().reset_index(name='COUNT') print (df1)    A  COUNT 0  a      2 1  b      3 2  c      2 

Function GroupBy.count excludes missing values:

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT') print (df2)    A  COUNT 0  a      2 1  b      2 2  c      1 

This function should be used for multiple columns for counting non-missing values:

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index() print (df3)    A  B_COUNT  C_COUNT  D_COUNT 0  a        2        2        1 1  b        3        2        3 2  c        2        1        1 

A related function is Series.value_counts. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaNs values by default.

df4 = (df['A'].value_counts()               .rename_axis('A')               .reset_index(name='COUNT')) print (df4)    A  COUNT 0  b      3 1  a      2 2  c      2 

If you want same output like using function groupby + size, add Series.sort_index:

df5 = (df['A'].value_counts()               .sort_index()               .rename_axis('A')               .reset_index(name='COUNT')) print (df5)    A  COUNT 0  a      2 1  b      3 2  c      2 

Question 5

How can I create a new column filled by aggregated values?

Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped.

See the Pandas documentation for more information.

np.random.seed(123)  df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],                     'B' : ['one', 'two', 'three','two', 'two', 'one'],                     'C' : np.random.randint(5, size=6),                     'D' : np.random.randint(5, size=6)}) print (df)      A      B  C  D 0  foo    one  2  3 1  foo    two  4  1 2  bar  three  2  1 3  foo    two  1  0 4  bar    two  3  1 5  foo    one  2  1   df['C1'] = df.groupby('A')['C'].transform('sum') df['C2'] = df.groupby(['A','B'])['C'].transform('sum')   df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum') df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')  print (df)       A      B  C  D  C1  C2  C3  D3  C4  D4 0  foo    one  2  3   9   4   9   5   4   4 1  foo    two  4  1   9   5   9   5   5   1 2  bar  three  2  1   5   2   5   2   2   1 3  foo    two  1  0   9   5   9   5   5   1 4  bar    two  3  1   5   3   5   2   3   1 5  foo    one  2  1   9   4   9   5   4   4 
like image 137
jezrael Avatar answered Oct 11 '22 11:10

jezrael