Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas if any n of m conditions are met

Tags:

python

pandas

Example.

Let's say I have dataframe with several columns and I want to select rows that match all 4 conditions, I would write:

condition = (df['A'] < 10) & (df['B'] < 10) & (df['C'] < 10) & (df['D'] < 10)
df.loc[condition]

Contrary to that if I want to select rows that match any of 4 conditions I would write:

condition = (df['A'] < 10) | (df['B'] < 10) | (df['C'] < 10) | (df['D'] < 10)
df.loc[condition]

Now, what if I want to select rows that match any two of those 4 conditions? That would be rows that match any combination of either columns (A and B), (A and C), (A and D), (B and C) or (C and D). It is obvious that I can write complex condition with all those combinations:

condition = ((df['A'] < 10) & (df['B'] < 10)) |\
            ((df['A'] < 10) & (df['C'] < 10)) |\
            ((df['A'] < 10) & (df['D'] < 10)) |\
            ((df['B'] < 10) & (df['C'] < 10)) |\
            ((df['C'] < 10) & (df['D'] < 10))
df.loc[condition]

But if there is 50 columns and I want to match any 20 columns of those 50, that would become impossible to list all possible combinations into condition. Is there a way to do that somehow better?

like image 574
Ivan Toman Avatar asked Jan 25 '23 09:01

Ivan Toman


1 Answers

Since True == 1 and False == 0 you can find rows that satisfy atleast N conditions by checking the sum. Series have most of the basic comparisons as attributes so you could make a single condition list with a variety of checks and then use getattr to make it tidy.

import pandas as pd
import numpy as np

np.random.seed(123)
df = pd.DataFrame(np.random.randint(0, 20, (5,4)), columns=list('ABCD'))

# can check `eq`, `lt`, `le`, `gt`, `ge`, `isin`
cond_list = [('A', 'lt', 10), ('B', 'ge', 10), ('D', 'eq', 4), 
             ('C', 'isin', [2, 4, 6])]
df_c = pd.concat([getattr(df[col], attr)(val).astype(int) 
                  for col,attr,val in cond_list], axis=1)
#   A  B  D  C
#0  0  0  0  1
#1  0  1  0  0
#2  1  1  0  0
#3  1  1  0  0
#4  0  1  0  1

# Each row satisfies this many conditions
df_c.sum(1)
#0    1
#1    1
#2    2
#3    2
#4    2
#dtype: int64

#Select those that satisfy at least 2.
df[df_c.sum(1).ge(2)]
#    A   B   C   D
#2   0  17  15   9
#3   0  14   0  15
#4  19  14   4   0

If you need some more complicated comparisons that aren't possible with .getattr then you can write them out yourself and concat that list of Series.

df_c = pd.concat([df['A'].lt(10), df['B'].ge(10), df['D'].eq(4), 
                  df['C'].isin([2,4,6])], 
                 axis=1).astype(int)
like image 163
ALollz Avatar answered Jan 31 '23 23:01

ALollz