The trouble is this.
Lets say we have a pandas df that can be generated using the following:
month=['dec','dec','dec','jan','feb','feb','mar','mar']
category =['a','a','b','b','a','b','b','b']
sales=[1,10,2,5,12,4,3,1]
df = pd.DataFrame(list(zip(month,category,sales)),
columns =['month', 'cat','sales'])
print(df)
| month cat sales |
|--------------------|
| 0 dec a 1 |
| 1 dec a 10 |
| 2 dec b 2 |
| 3 jan b 5 |
| 4 feb a 12 |
| 5 feb b 4 |
| 6 mar b 3 |
| 7 mar b 1 |
then let's suppose we would like a count of each category by month.
so we go and do something like
df=df.groupby(['month','cat']).sales.sum().reset_index()
print(df)
| month cat sales |
|--------------------|
| 0 dec a 11 |
| 1 dec b 2 |
| 2 feb a 12 |
| 3 feb b 4 |
| 4 jan b 5 |
| 5 mar b 4 |
But what we'd like to see is:
| month cat sales |
|--------------------|
| 0 dec a 11 |
| 1 dec b 2 |
| 2 feb a 12 |
| 3 feb b 4 |
| 4 jan b 5 |
| 5 jan a 0 |
| 6 mar b 4 |
| 7 mar a 0 |
Where the difference is categories that did not show up in a particular month would still show up just with zero as their total.
It's probable this has been asked before, but I couldn't find it. If you point me in the direction of the question, we'll go ahead and delete this one.
Continuing from where you stopped, a combo of stack
and unstack
will give you your required output:
res = (
df.groupby(['month', 'cat'])
.sales.sum()
.unstack(fill_value=0) # Unstack and fill value for the null column
.stack() # Return to groupby form and reset
.reset_index(name='sales')
)
The output of res
:
>>> res
month cat sales
0 dec a 11
1 dec b 2
2 feb a 12
3 feb b 4
4 jan a 0
5 jan b 5
6 mar a 0
7 mar b 4
You can also work with categoricals and set observed
to False; this will ensure that all possible combinations are presented in the final output.
(df.astype({'month' : 'category',
'cat' : 'category'})
.groupby(['month', 'cat'],
as_index = False,
observed = False)
.sum(numeric_only = True)
)
month cat sales
0 dec a 11
1 dec b 2
2 feb a 12
3 feb b 4
4 jan a 0
5 jan b 5
6 mar a 0
7 mar b 4
Use MultiIndex
with reindex
as:
df=(
df.groupby(['month','cat']).sales.sum()
.reindex(pd.MultiIndex.from_product([df.month.unique(), df.cat.unique()],
names=['month', 'cat']), fill_value=0)
.reset_index()
)
print(df)
month cat sales
0 dec a 11
1 dec b 2
2 feb a 12
3 feb b 4
4 jan a 0
5 jan b 5
6 mar a 0
7 mar b 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With