Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: filtering by group size and data value

Tags:

python

pandas

Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.

Initial data:

df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'], 
                            'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
                            'Sales' : [13,6,16,8,4,3,1]})

        City Province  Sales
0    Toronto       ON     13
1   Montreal       QC      6
2  Vancouver       BC     16
3    Calgary       AL      8
4   Edmonton       AL      4
5   Winnipeg       MN      3
6    Windsor       ON      1

Now grouping the data:

df.groupby(['Province', 'City']).sum()

                    Sales
Province City
AL       Calgary        8
         Edmonton       4
BC       Vancouver     16
MN       Winnipeg       3
ON       Toronto       13
         Windsor        1
QC       Montreal       6

Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:

                    Sales
Province City
AL       Calgary        8
         Edmonton       4
BC       Vancouver     16
ON       Toronto       13
         Windsor        1

I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.

like image 919
Dmitry B. Avatar asked Mar 18 '16 23:03

Dmitry B.


2 Answers

you can do it this way:

In [188]: df
Out[188]:
        City Province  Sales
0    Toronto       ON     13
1   Montreal       QC      6
2  Vancouver       BC     16
3    Calgary       AL      8
4   Edmonton       AL      4
5   Winnipeg       MN      3
6    Windsor       ON      1

In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()

In [190]: g
Out[190]:
  Province       City  Sales
0       AL    Calgary      8
1       AL   Edmonton      4
2       BC  Vancouver     16
3       MN   Winnipeg      3
4       ON    Toronto     13
5       ON    Windsor      1
6       QC   Montreal      6

Now we will create a mask for those 'provinces with more than one city':

In [191]: mask = g.groupby('Province').City.transform('count') > 1

In [192]: mask
Out[192]:
0     True
1     True
2    False
3    False
4     True
5     True
6    False
dtype: bool

And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
  Province       City  Sales
0       AL    Calgary      8
1       AL   Edmonton      4
2       BC  Vancouver     16
4       ON    Toronto     13
5       ON    Windsor      1
like image 147
MaxU - stop WAR against UA Avatar answered Sep 19 '22 04:09

MaxU - stop WAR against UA


I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:

In [72]: df
Out[72]:
        City Province  Sales
0    Toronto       ON     13
1   Montreal       QC      6
2  Vancouver       BC     16
3    Calgary       AL      8
4   Edmonton       AL      4
5   Winnipeg       MN      3
6    Windsor       ON      1

In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
                    Sales
Province City
AL       Calgary        8
         Edmonton       4
BC       Vancouver     16
ON       Toronto       13
         Windsor        1
like image 28
Dmitry B. Avatar answered Sep 19 '22 04:09

Dmitry B.