Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.
Initial data:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
Now grouping the data:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.
you can do it this way:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
Now we will create a mask for those 'provinces with more than one city':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
And cities with the total sales greater or equal to 10 win:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With