Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Logical OR on a subset of columns in a DataFrame

Tags:

python

pandas

I want to get all rows where (at least) one of the columns in df[mylist] contains True.

I'm currently doing:

df = df[ df[mylist[0]] | df[mylist[1]] | df[mylist[2]] ]

where mylist is a list of strings relating to columns of df. But I would like to do this for any length of mylist.

The only way I can think of to do this is to loop over mylist and create an new dataframe for each element of it and merge/concat or whatever them afterwards. But that doesn't look very smart for me.

Is there a better way?

like image 728
Peter Klauke Avatar asked Jul 24 '15 08:07

Peter Klauke


2 Answers

Building on LondonRob's answer, you could use

df.loc[df[mylist].any(axis=1)]

Calling the DataFrame's any method will perform better than using apply to call Python's builtin any function once per row.

Or you could use np.logical_or.reduce:

df.loc[np.logical_or.reduce(df[mylist], axis=1)]

For large DataFrames, using np.logical_or may be quicker:

In [30]: df = pd.DataFrame(np.random.binomial(1, 0.1, size=(100,300)).astype(bool))

In [31]: %timeit df.loc[np.logical_or.reduce(df, axis=1)]
1000 loops, best of 3: 261 µs per loop

In [32]: %timeit df.loc[df.any(axis=1)]
1000 loops, best of 3: 636 µs per loop

In [33]: %timeit df[df.apply(any, axis=1)]
100 loops, best of 3: 2.13 ms per loop

Note that df.any has extra features, such as the ability to skip NaNs. In this case, if the columns are boolean-valued, then there can not be any NaNs (since NaNs are float values). So np.logical_or.reduce is quicker.


import numpy as np
import pandas as pd
np.random.seed(2014)
df = pd.DataFrame(np.random.binomial(1, 0.1, size=(10,3)).astype(bool), 
                  columns=list('ABC'))
print(df)
#        A      B      C
# 0  False  False  False
# 1   True  False  False
# 2  False  False  False
# 3   True  False  False
# 4  False  False  False
# 5  False  False  False
# 6  False   True  False
# 7  False  False  False
# 8  False  False  False
# 9  False  False  False

mylist = list('ABC')
print(df[ df[mylist[0]] | df[mylist[1]] | df[mylist[2]] ])
print(df.loc[df[mylist].any(axis=1)])
print(df.loc[np.logical_or.reduce(df[mylist], axis=1)])

yields the rows where at least one of the columns is True:

       A      B      C
1   True  False  False
3   True  False  False
6  False   True  False
like image 56
unutbu Avatar answered Oct 02 '22 14:10

unutbu


There's a much simpler way to do this using python's built in any function:

In []: mylist
Out[]: ['A', 'B']

In []: df
Out[]: 
       A      B      C
0  False  False  False
1   True  False  False
2  False  False  False
3   True  False  False
4  False  False  False
5  False  False  False
6  False   True  False
7  False  False  False
8  False  False  False
9  False  False  False

You can apply the function any along the rows of df by using axis=1. In this case I'll only apply any to a subset of the columns:

In []: df[mylist].apply(any, axis=1)
Out[]: 
0    False
1     True
2    False
3     True
4    False
5    False
6     True
7    False
8    False
9    False
dtype: bool

This gives us the perfect way to select our rows:

In []: df[df[mylist].apply(any, axis=1)]
Out[]: 
       A      B      C
1   True  False  False
3   True  False  False
6  False   True  False
like image 24
LondonRob Avatar answered Oct 02 '22 15:10

LondonRob