Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to automatically group conditions together in python?

I'm trying to automatically group conditions together in python. The difficulty lies in that if there are several conditions, like 100 conditions, it would be tedious to "AND" all of these by hand. How can I use a loop to achieve this?

import pandas as pd
s1 = pd.Series([1,2,3,4,5,6])
s2 = pd.Series([5,6,7,8,9,10])
s3 = pd.Series([11,12,5,7,8,2])
df = pd.DataFrame({'A': s1,'B': s2,'C': s3})

condition1 = df['A'] > 3
condition2 = df['B'] > 6
condition3 = df['C'] > 5
# AND Operation ->>> Can be achieved with a loop?
select = condition1 & condition2 & condition3
like image 593
zonelincosmos Avatar asked Dec 24 '22 06:12

zonelincosmos


2 Answers

A couple of points to note:

  • You can use a list or dictionary to store a variable number of variables.
  • Your Boolean series function as NumPy arrays; you can therefore combine a collection of series and use np.ndarray.all (or pd.DataFrame.all) to calculate their intersection.

You can use a list of Boolean series with NumPy or Pandas:

conditions = [df['A'] > 3,
              df['B'] > 6,
              df['C'] > 5]

# all equivalent
select = pd.concat(conditions, axis=1).all(axis=1)
select = np.logical_and.reduce(conditions)
select = np.array(conditions).all(axis=0)

print(select)

array([False, False, False,  True,  True, False], dtype=bool)

Equivalently, if you wish to name your Boolean filters, you can use a dictionary:

conditions = {1: df['A'] > 3,
              2: df['B'] > 6,
              3: df['C'] > 5}

select = np.array(list(conditions.values())).all(axis=0)

Performance benchmarking

Performance will be very data-dependent, you should also try reduce as per @Kopytok's solution and check performance with your data.

df = pd.concat([df]*1000)

conditions = [df['A'] > 3,
              df['B'] > 6,
              df['C'] > 5]

conditions = conditions*100

%timeit reduce(lambda x, y: x & y, conditions)     # 104 ms per loop
%timeit np.logical_and.reduce(conditions)          # 104 ms per loop
%timeit np.array(conditions).all(axis=0)           # 99.4 ms per loop
%timeit pd.concat(conditions, axis=1).all(axis=1)  # 34.6 ms per loop
like image 69
jpp Avatar answered Dec 25 '22 19:12

jpp


You can achive it by creating a list of conditions and using reduce:

from functools import reduce

conditions = [
    df['A'] > 3,
    df['B'] > 6,
    df['C'] > 5,
]

total_condition = reduce(lambda x, y: x & y, conditions)

Test case:

d = pd.DataFrame(np.random.randint(1, 5, (700000, 3)), columns=["a", "b", "c"])

conditions = [
    d["a"] > 2,
    d["c"] > 1,
    d["b"] > 2,
]*100

Using reduce:

from functools import reduce

%timeit reduce(lambda x, y: x & y, conditions)
> 547 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using np.concat + df.all():

%timeit pd.concat(conditions, axis=1).all(1)
> 4.19 s ± 367 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 28
koPytok Avatar answered Dec 25 '22 20:12

koPytok