I'm trying to automatically group conditions together in python. The difficulty lies in that if there are several conditions, like 100 conditions, it would be tedious to "AND" all of these by hand. How can I use a loop to achieve this?
import pandas as pd
s1 = pd.Series([1,2,3,4,5,6])
s2 = pd.Series([5,6,7,8,9,10])
s3 = pd.Series([11,12,5,7,8,2])
df = pd.DataFrame({'A': s1,'B': s2,'C': s3})
condition1 = df['A'] > 3
condition2 = df['B'] > 6
condition3 = df['C'] > 5
# AND Operation ->>> Can be achieved with a loop?
select = condition1 & condition2 & condition3
A couple of points to note:
np.ndarray.all
(or pd.DataFrame.all
) to calculate their intersection.You can use a list of Boolean series with NumPy or Pandas:
conditions = [df['A'] > 3,
df['B'] > 6,
df['C'] > 5]
# all equivalent
select = pd.concat(conditions, axis=1).all(axis=1)
select = np.logical_and.reduce(conditions)
select = np.array(conditions).all(axis=0)
print(select)
array([False, False, False, True, True, False], dtype=bool)
Equivalently, if you wish to name your Boolean filters, you can use a dictionary:
conditions = {1: df['A'] > 3,
2: df['B'] > 6,
3: df['C'] > 5}
select = np.array(list(conditions.values())).all(axis=0)
Performance benchmarking
Performance will be very data-dependent, you should also try reduce
as per @Kopytok's solution and check performance with your data.
df = pd.concat([df]*1000)
conditions = [df['A'] > 3,
df['B'] > 6,
df['C'] > 5]
conditions = conditions*100
%timeit reduce(lambda x, y: x & y, conditions) # 104 ms per loop
%timeit np.logical_and.reduce(conditions) # 104 ms per loop
%timeit np.array(conditions).all(axis=0) # 99.4 ms per loop
%timeit pd.concat(conditions, axis=1).all(axis=1) # 34.6 ms per loop
You can achive it by creating a list of conditions and using reduce
:
from functools import reduce
conditions = [
df['A'] > 3,
df['B'] > 6,
df['C'] > 5,
]
total_condition = reduce(lambda x, y: x & y, conditions)
Test case:
d = pd.DataFrame(np.random.randint(1, 5, (700000, 3)), columns=["a", "b", "c"])
conditions = [
d["a"] > 2,
d["c"] > 1,
d["b"] > 2,
]*100
Using reduce
:
from functools import reduce
%timeit reduce(lambda x, y: x & y, conditions)
> 547 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using np.concat
+ df.all()
:
%timeit pd.concat(conditions, axis=1).all(1)
> 4.19 s ± 367 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With