Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return aggregate for all unique in a group

The trouble is this.

Lets say we have a pandas df that can be generated using the following:

month=['dec','dec','dec','jan','feb','feb','mar','mar']
category =['a','a','b','b','a','b','b','b']
sales=[1,10,2,5,12,4,3,1]

df = pd.DataFrame(list(zip(month,category,sales)), 
                   columns =['month', 'cat','sales']) 

print(df)

| month cat  sales   |
|--------------------|
| 0   dec   a      1 |
| 1   dec   a     10 |
| 2   dec   b      2 |
| 3   jan   b      5 |
| 4   feb   a     12 |
| 5   feb   b      4 |
| 6   mar   b      3 |
| 7   mar   b      1 |

then let's suppose we would like a count of each category by month.

so we go and do something like

df=df.groupby(['month','cat']).sales.sum().reset_index()
print(df)
|  month cat  sales  |
|--------------------|
| 0   dec   a     11 |
| 1   dec   b      2 |
| 2   feb   a     12 |
| 3   feb   b      4 |
| 4   jan   b      5 |
| 5   mar   b      4 |

But what we'd like to see is:

|  month cat  sales  |
|--------------------|
| 0   dec   a     11 |
| 1   dec   b      2 |
| 2   feb   a     12 |
| 3   feb   b      4 |
| 4   jan   b      5 |
| 5   jan   a      0 |
| 6   mar   b      4 |
| 7   mar   a      0 |

Where the difference is categories that did not show up in a particular month would still show up just with zero as their total.

It's probable this has been asked before, but I couldn't find it. If you point me in the direction of the question, we'll go ahead and delete this one.

like image 872
born_naked Avatar asked May 01 '20 02:05

born_naked


2 Answers

Continuing from where you stopped, a combo of stack and unstack will give you your required output:

res = (
    df.groupby(['month', 'cat'])
    .sales.sum()
    .unstack(fill_value=0)  # Unstack and fill value for the null column
    .stack()  # Return to groupby form and reset
    .reset_index(name='sales')
)

The output of res:

>>> res

  month cat sales
0   dec a   11
1   dec b   2
2   feb a   12
3   feb b   4
4   jan a   0
5   jan b   5
6   mar a   0
7   mar b   4

You can also work with categoricals and set observed to False; this will ensure that all possible combinations are presented in the final output.

(df.astype({'month' : 'category',
            'cat' : 'category'})
   .groupby(['month', 'cat'], 
             as_index = False, 
             observed = False)
   .sum(numeric_only = True)
)

  month cat  sales
0   dec   a     11
1   dec   b      2
2   feb   a     12
3   feb   b      4
4   jan   a      0
5   jan   b      5
6   mar   a      0
7   mar   b      4
like image 151
sammywemmy Avatar answered Oct 06 '22 22:10

sammywemmy


Use MultiIndex with reindex as:

df=(
    df.groupby(['month','cat']).sales.sum()
    .reindex(pd.MultiIndex.from_product([df.month.unique(), df.cat.unique()], 
                                   names=['month', 'cat']), fill_value=0)
    .reset_index()
)

print(df)
  month cat  sales
0   dec   a     11
1   dec   b      2
2   feb   a     12
3   feb   b      4
4   jan   a      0
5   jan   b      5
6   mar   a      0
7   mar   b      4
like image 6
Space Impact Avatar answered Oct 06 '22 22:10

Space Impact