I have a dataframe with lots of rows. Sometimes are values are one ofs and not very useful for my purpose.
How can I remove all the rows from where columns 2 and 3's value doesn't appear more than 5 times?
df input
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple potato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 grape tomato banana
1 pear tomato banana
1 lemon tomato banana
output
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
Global Counts
Use stack
+ value_counts
+ replace
-
v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
(Update)
Columnwise Counts
Call apply
with pd.Series.value_counts
on your columns of interest, and filter in the same manner as before -
v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
Details
Use value_counts
to count values in your dataframe -
c = v.apply(pd.Series.value_counts)
c
Col2 Col3
apple 6.0 NaN
grape 1.0 NaN
lemon 1.0 NaN
pear 1.0 NaN
potato NaN 1.0
tomato NaN 8.0
Call replace
, to replace values in the DataFrame with their counts -
i = v.replace(c)
i
Col2 Col3
0 6 8
1 6 1
2 6 8
3 6 8
4 6 8
5 6 8
6 1 8
7 1 8
8 1 8
From that point,
m = i.gt(5).all(1)
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool
Use the mask to index df
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With