I have a large data frame (40M rows) and I want to filter out rows based on one column if the value meets a condition in a groupby object.
For example, here is some random data. The 'letter' column would actually have thousands of unique values:
x y z letter
0 47 86 30 e
1 58 9 28 b
2 96 59 42 a
3 79 6 45 e
4 77 80 37 d
5 66 91 35 d
6 96 31 52 d
7 56 8 26 e
8 78 96 14 a
9 22 60 13 e
10 75 82 9 d
11 5 54 29 c
12 83 31 40 e
13 37 70 2 c
14 53 67 66 a
15 76 33 78 d
16 64 67 81 b
17 23 94 1 d
18 10 1 31 e
19 52 11 3 d
Apply a groupby on the 'letter' column, and get the sum of column x for each letter:
df.groupby('letter').x.sum()
>>> a 227
b 122
c 42
d 465
e 297
Then, I sort to see the letters with the highest sum, and manually identify a threshold. In this example the threshold might be 200.
df.groupby('letter').x.sum().reset_index().sort_values('x', ascending=False)
>>> letter x
3 d 465
4 e 297
0 a 227
1 b 122
2 c 42
Here's where I am stuck. In the original dataframe, I want to keep letters if the groupby sum of column 'x' > 200, and drop the other rows. So in this example, it would keep all the rows with d, e or a.
I was trying something like this but it doesn't work:
df.groupby('letter').x.sum().filter(lambda x: len(x) > 200)
And even if I filter the groupby object, how do I use it to filter the original dataframe?
Note that the name argument within reset_index () specifies the name for the new column produced by GroupBy. We can also confirm that the result is indeed a pandas DataFrame: #display object type of df_out type(df_out) pandas.core.frame.DataFrame Note: You can find the complete documentation for the GroupBy operation in pandas here.
Pandas Filter : filter () The pandas filter function helps in generating a subset of the dataframe rows or columns according to the specified index labels.
DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs)[source]¶ Return a copy of a DataFrame excluding filtered elements. Elements from groups are filtered if they do not satisfy the boolean criterion specified by func.
In this example, regex is used along with the pandas filter function. Here, with the help of regex, we are able to fetch the values of column (s) which have column name that has “o” at the end. The ‘$’ is used as a wildcard suggesting that column name should end with “o”.
You can use groupby
transform
to calculate a the sum of x for each row and create a logical series with the condition with which you can do the subset:
df1 = df[df.x.groupby(df.letter).transform('sum') > 200]
df1.letter.unique()
# array(['e', 'a', 'd'], dtype=object)
Or another option using groupby.filter
:
df2 = df.groupby('letter').filter(lambda g: g.x.sum() > 200)
df2.letter.unique()
# array(['e', 'a', 'd'], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With