Filter pandas data frame with results of groupby

Tags:

I have a large data frame (40M rows) and I want to filter out rows based on one column if the value meets a condition in a groupby object.

For example, here is some random data. The 'letter' column would actually have thousands of unique values:

     x   y   z  letter
0   47  86  30  e
1   58   9  28  b
2   96  59  42  a
3   79   6  45  e
4   77  80  37  d
5   66  91  35  d
6   96  31  52  d
7   56   8  26  e
8   78  96  14  a
9   22  60  13  e
10  75  82   9  d
11   5  54  29  c
12  83  31  40  e
13  37  70   2  c
14  53  67  66  a
15  76  33  78  d
16  64  67  81  b
17  23  94   1  d
18  10   1  31  e
19  52  11   3  d

Apply a groupby on the 'letter' column, and get the sum of column x for each letter:

df.groupby('letter').x.sum()
>>> a    227
    b    122
    c     42
    d    465
    e    297

Then, I sort to see the letters with the highest sum, and manually identify a threshold. In this example the threshold might be 200.

df.groupby('letter').x.sum().reset_index().sort_values('x', ascending=False)
>>> letter    x
3      d  465
4      e  297
0      a  227
1      b  122
2      c   42

Here's where I am stuck. In the original dataframe, I want to keep letters if the groupby sum of column 'x' > 200, and drop the other rows. So in this example, it would keep all the rows with d, e or a.

I was trying something like this but it doesn't work:

df.groupby('letter').x.sum().filter(lambda x: len(x) > 200)

And even if I filter the groupby object, how do I use it to filter the original dataframe?

803

asked Jan 31 '17 02:01

laurakurup

1 Answers

You can use groupby transform to calculate a the sum of x for each row and create a logical series with the condition with which you can do the subset:

df1 = df[df.x.groupby(df.letter).transform('sum') > 200]

df1.letter.unique()
# array(['e', 'a', 'd'], dtype=object)

Or another option using groupby.filter:

df2 = df.groupby('letter').filter(lambda g: g.x.sum() > 200)

df2.letter.unique()
# array(['e', 'a', 'd'], dtype=object)

152

answered Oct 21 '22 00:10

Psidom

Related questions
                            
                                How to get printed output from ctypes C functions into Jupyter/IPython notebook?
                            
                                Proper POST file upload (load testing with Locust)
                            
                                Pandas Melt on Multi-index Columns Without Manually Specifying Levels (Python 3.5.1)
                            
                                MemoryError when creating a very large numpy array [duplicate]
                            
                                Enable oauth login with django-allauth but a custom provider
                            
                                Can you obtain physical size of device in kivy?
                            
                                Python Wheels on linux (how? and why?)
                            
                                Bounding boxes using tensorflow and inception-v3
                            
                                Efficient way to get the unique values from 2 or more columns in a Dataframe
                            
                                Skip rows with missing values in read_csv
                            
                                How can I get the default colors in GTK?
                            
                                How to create unittests for python prompt toolkit?
                            
                                Why is ctypes so slow to convert a Python list to a C array?
                            
                                How to access directory file outside django project?
                            
                                Create a virtualenv from another virtualenv
                            
                                All possibilities to split a list into two lists
                            
                                Python typehints and linters
                            
                                How to resolve "ImportError: DLL load failed:" on Python?
                            
                                Getting "502 Bad Gateway" with nginx, uwsgi python-flask on ubuntu 16.04
                            
                                Why does PyYAML use generators to construct objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter pandas data frame with results of groupby

Tags:

python

pandas

dataframe

filter

python-2.7

laurakurup

People also ask

1 Answers

Psidom

Recent Activity

Donate For Us