Example.
Let's say I have dataframe with several columns and I want to select rows that match all 4 conditions, I would write:
condition = (df['A'] < 10) & (df['B'] < 10) & (df['C'] < 10) & (df['D'] < 10)
df.loc[condition]
Contrary to that if I want to select rows that match any of 4 conditions I would write:
condition = (df['A'] < 10) | (df['B'] < 10) | (df['C'] < 10) | (df['D'] < 10)
df.loc[condition]
Now, what if I want to select rows that match any two of those 4 conditions? That would be rows that match any combination of either columns (A and B), (A and C), (A and D), (B and C) or (C and D). It is obvious that I can write complex condition with all those combinations:
condition = ((df['A'] < 10) & (df['B'] < 10)) |\
((df['A'] < 10) & (df['C'] < 10)) |\
((df['A'] < 10) & (df['D'] < 10)) |\
((df['B'] < 10) & (df['C'] < 10)) |\
((df['C'] < 10) & (df['D'] < 10))
df.loc[condition]
But if there is 50 columns and I want to match any 20 columns of those 50, that would become impossible to list all possible combinations into condition. Is there a way to do that somehow better?
Since True == 1
and False == 0
you can find rows that satisfy atleast N conditions by checking the sum. Series have most of the basic comparisons as attributes so you could make a single condition list with a variety of checks and then use getattr
to make it tidy.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0, 20, (5,4)), columns=list('ABCD'))
# can check `eq`, `lt`, `le`, `gt`, `ge`, `isin`
cond_list = [('A', 'lt', 10), ('B', 'ge', 10), ('D', 'eq', 4),
('C', 'isin', [2, 4, 6])]
df_c = pd.concat([getattr(df[col], attr)(val).astype(int)
for col,attr,val in cond_list], axis=1)
# A B D C
#0 0 0 0 1
#1 0 1 0 0
#2 1 1 0 0
#3 1 1 0 0
#4 0 1 0 1
# Each row satisfies this many conditions
df_c.sum(1)
#0 1
#1 1
#2 2
#3 2
#4 2
#dtype: int64
#Select those that satisfy at least 2.
df[df_c.sum(1).ge(2)]
# A B C D
#2 0 17 15 9
#3 0 14 0 15
#4 19 14 4 0
If you need some more complicated comparisons that aren't possible with .getattr
then you can write them out yourself and concat that list of Series.
df_c = pd.concat([df['A'].lt(10), df['B'].ge(10), df['D'].eq(4),
df['C'].isin([2,4,6])],
axis=1).astype(int)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With