Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Including missing combinations of values in a pandas groupby aggregation

Tags:

python

pandas

Problem

Including all possible values or combinations of values in the output of a pandas groupby aggregation.

Example

Example pandas DataFrame has three columns, User, Code, and Subtotal:

import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])

I'd like to group on User and Code and get a subtotal for each combination of User and Code.

print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())

The output I get is:

  User   Code   Subtotal
0    a      1          1
1    a      2          1
2    b      1          1
3    b      2          1
4    c      1          2

How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?

Preferred output

Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.

  User   Code   Subtotal
0    a      1          1
1    a      2          1
2    b      1          1
3    b      2          1
4    c      1          2
5    c      2          0
like image 775
ajrwhite Avatar asked Mar 17 '17 10:03

ajrwhite


1 Answers

You can use unstack with stack:

print(example_df.groupby(['User', 'Code']).Subtotal.sum()
                .unstack(fill_value=0)
                .stack()
                .reset_index(name='Subtotal'))
  User  Code  Subtotal
0    a     1         1
1    a     2         1
2    b     1         1
3    b     2         1
4    c     1         2
5    c     2         0

Another solution with reindex by MultiIndex created from_product:

df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['User', 'Code'])

print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
  User  Code  Subtotal
0    a     1         1
1    a     2         1
2    b     1         1
3    b     2         1
4    c     1         2
5    c     2         0
like image 119
jezrael Avatar answered Nov 05 '22 16:11

jezrael