I have a dataframe with lots of rows. Sometimes are values are one ofs and not very useful for my purpose.
How can I remove all the rows from where columns 2 and 3's value doesn't appear more than 5 times?
df input
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple potato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 grape tomato banana
1 pear tomato banana
1 lemon tomato banana
output
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
Global Counts
Use stack + value_counts + replace -
v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
(Update)
Columnwise Counts
Call apply with pd.Series.value_counts on your columns of interest, and filter in the same manner as before -
v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
Details
Use value_counts to count values in your dataframe -
c = v.apply(pd.Series.value_counts)
c
Col2 Col3
apple 6.0 NaN
grape 1.0 NaN
lemon 1.0 NaN
pear 1.0 NaN
potato NaN 1.0
tomato NaN 8.0
Call replace, to replace values in the DataFrame with their counts -
i = v.replace(c)
i
Col2 Col3
0 6 8
1 6 1
2 6 8
3 6 8
4 6 8
5 6 8
6 1 8
7 1 8
8 1 8
From that point,
m = i.gt(5).all(1)
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool
Use the mask to index df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With