Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter pandas data frame with results of groupby

I have a large data frame (40M rows) and I want to filter out rows based on one column if the value meets a condition in a groupby object.

For example, here is some random data. The 'letter' column would actually have thousands of unique values:

     x   y   z  letter
0   47  86  30  e
1   58   9  28  b
2   96  59  42  a
3   79   6  45  e
4   77  80  37  d
5   66  91  35  d
6   96  31  52  d
7   56   8  26  e
8   78  96  14  a
9   22  60  13  e
10  75  82   9  d
11   5  54  29  c
12  83  31  40  e
13  37  70   2  c
14  53  67  66  a
15  76  33  78  d
16  64  67  81  b
17  23  94   1  d
18  10   1  31  e
19  52  11   3  d

Apply a groupby on the 'letter' column, and get the sum of column x for each letter:

df.groupby('letter').x.sum()
>>> a    227
    b    122
    c     42
    d    465
    e    297

Then, I sort to see the letters with the highest sum, and manually identify a threshold. In this example the threshold might be 200.

df.groupby('letter').x.sum().reset_index().sort_values('x', ascending=False)
>>> letter    x
3      d  465
4      e  297
0      a  227
1      b  122
2      c   42

Here's where I am stuck. In the original dataframe, I want to keep letters if the groupby sum of column 'x' > 200, and drop the other rows. So in this example, it would keep all the rows with d, e or a.

I was trying something like this but it doesn't work:

df.groupby('letter').x.sum().filter(lambda x: len(x) > 200)

And even if I filter the groupby object, how do I use it to filter the original dataframe?

like image 803
laurakurup Avatar asked Jan 31 '17 02:01

laurakurup


People also ask

How do I know if a pandas Dataframe is a groupby?

Note that the name argument within reset_index () specifies the name for the new column produced by GroupBy. We can also confirm that the result is indeed a pandas DataFrame: #display object type of df_out type(df_out) pandas.core.frame.DataFrame Note: You can find the complete documentation for the GroupBy operation in pandas here.

What is filter in pandas Dataframe?

Pandas Filter : filter () The pandas filter function helps in generating a subset of the dataframe rows or columns according to the specified index labels.

How to return a copy of a Dataframe excluding filtered elements?

DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs)[source]¶ Return a copy of a DataFrame excluding filtered elements. Elements from groups are filtered if they do not satisfy the boolean criterion specified by func.

How to use regex with pandas filter?

In this example, regex is used along with the pandas filter function. Here, with the help of regex, we are able to fetch the values of column (s) which have column name that has “o” at the end. The ‘$’ is used as a wildcard suggesting that column name should end with “o”.


1 Answers

You can use groupby transform to calculate a the sum of x for each row and create a logical series with the condition with which you can do the subset:

df1 = df[df.x.groupby(df.letter).transform('sum') > 200]

df1.letter.unique()
# array(['e', 'a', 'd'], dtype=object)

Or another option using groupby.filter:

df2 = df.groupby('letter').filter(lambda g: g.x.sum() > 200)

df2.letter.unique()
# array(['e', 'a', 'd'], dtype=object)
like image 152
Psidom Avatar answered Oct 21 '22 00:10

Psidom