I want to get all rows where (at least) one of the columns in df[mylist] contains True.
I'm currently doing:
df = df[ df[mylist[0]] | df[mylist[1]] | df[mylist[2]] ]
where mylist
is a list of strings relating to columns of df
. But I would like to do this for any length of mylist
.
The only way I can think of to do this is to loop over mylist
and create an new dataframe for each element of it and merge/concat or whatever them afterwards. But that doesn't look very smart for me.
Is there a better way?
Building on LondonRob's answer, you could use
df.loc[df[mylist].any(axis=1)]
Calling the DataFrame's any
method will perform better than using apply
to call Python's builtin any
function once per row.
Or you could use np.logical_or.reduce
:
df.loc[np.logical_or.reduce(df[mylist], axis=1)]
For large DataFrames, using np.logical_or
may be quicker:
In [30]: df = pd.DataFrame(np.random.binomial(1, 0.1, size=(100,300)).astype(bool))
In [31]: %timeit df.loc[np.logical_or.reduce(df, axis=1)]
1000 loops, best of 3: 261 µs per loop
In [32]: %timeit df.loc[df.any(axis=1)]
1000 loops, best of 3: 636 µs per loop
In [33]: %timeit df[df.apply(any, axis=1)]
100 loops, best of 3: 2.13 ms per loop
Note that df.any
has extra features, such as the ability to skip NaNs.
In this case, if the columns are boolean-valued, then there can not be any NaNs
(since NaNs are float values). So np.logical_or.reduce
is quicker.
import numpy as np
import pandas as pd
np.random.seed(2014)
df = pd.DataFrame(np.random.binomial(1, 0.1, size=(10,3)).astype(bool),
columns=list('ABC'))
print(df)
# A B C
# 0 False False False
# 1 True False False
# 2 False False False
# 3 True False False
# 4 False False False
# 5 False False False
# 6 False True False
# 7 False False False
# 8 False False False
# 9 False False False
mylist = list('ABC')
print(df[ df[mylist[0]] | df[mylist[1]] | df[mylist[2]] ])
print(df.loc[df[mylist].any(axis=1)])
print(df.loc[np.logical_or.reduce(df[mylist], axis=1)])
yields the rows where at least one of the columns is True:
A B C
1 True False False
3 True False False
6 False True False
There's a much simpler way to do this using python's built in any
function:
In []: mylist
Out[]: ['A', 'B']
In []: df
Out[]:
A B C
0 False False False
1 True False False
2 False False False
3 True False False
4 False False False
5 False False False
6 False True False
7 False False False
8 False False False
9 False False False
You can apply
the function any
along the rows of df
by using axis=1
. In this case I'll only apply any
to a subset of the columns:
In []: df[mylist].apply(any, axis=1)
Out[]:
0 False
1 True
2 False
3 True
4 False
5 False
6 True
7 False
8 False
9 False
dtype: bool
This gives us the perfect way to select our rows:
In []: df[df[mylist].apply(any, axis=1)]
Out[]:
A B C
1 True False False
3 True False False
6 False True False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With