Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter based on a boolean in a column and return a groupby obj

Tags:

python

pandas

I am trying to return a groupby object that contains only items with 3 or more entries for the field I am grouping on. For instance, if I have the following object and I group on 'Letters':

Letters Numbers Items Bool
A       1       lamp  1
B       2       glass 1
B       2       table 1
C       5       pic   0
B       3       glass 0
C       4       bottle 1
C       2       horse 0

Then my code would only return a groupby object with b

I have tried the following:

old_df = a_df.groupby('Letters')
new_df = old_df.filter(len(old_df['Letters'])>2)

I've also tried with some lambda stuff mentioned here: filtering grouped df in pandas

Thanks!


So if I wanted to return a vector with the sum of all the values in Bool per letter where the number of entries per letter exceeds three, how would I do it? With the data provided, this would be [2,1] (2 for B and 1 for C)

like image 517
goldisfine Avatar asked Sep 03 '25 03:09

goldisfine


1 Answers

I think you're misunderstanding the output of groupby:

In [4]: df
Out[4]:
  Letters  Numbers  Items  Bool
0       A        1   lamp     1
1       B        2  glass     1
2       B        2  table     1
3       C        5    pic     0
4       B        3  glass     0

In [6]: df.groupby('Letters').filter(lambda x: len(x) > 2)
Out[6]:
  Letters  Numbers  Items  Bool
1       B        2  glass     1
2       B        2  table     1
4       B        3  glass     0

The object that results from a groupby operation is an iterable object, and its "iterands" are tuples of the group label and the subset of the original DataFrame corresponding to the group. For example,

In [9]: for name, subset in df.groupby('Letters'):
   ...:     print(name)
   ...:     print(subset)
   ...:     print('\n')
   ...:
A
  Letters  Numbers Items  Bool
0       A        1  lamp     1


B
  Letters  Numbers  Items  Bool
1       B        2  glass     1
2       B        2  table     1
4       B        3  glass     0


C
  Letters  Numbers Items  Bool
3       C        5   pic     0

To sum the values of Bool by Letter where the number of members in each group is greater than 2:

In [39]: df.groupby('Letters').filter(lambda x: len(x) > 2).groupby('Letters').Bool.sum()
Out[39]:
Letters
B          2
C          1
Name: Bool, dtype: int64
like image 99
Phillip Cloud Avatar answered Sep 05 '25 20:09

Phillip Cloud