Pandas filter columns of a DataFrame with bool

Q: How to filter Dataframe by single column value in R?

Here we are going to filter dataframe by single column value by using loc [] function. This function will take column name as input and filter the data using relational operators. column refers the dataframe column name where value is filtered in this column value is the string/numeric data compared with actual column value in the dataframe

Tags:

python

pandas

dataframe

For a DataFrame (df) with multiple columns and rows

     A   B  C  D
0    1   4  2  6
1    2   5  7  4
2    3   6  5  6

and another DataFrame (dfBool) containing dtype: bool

0  True
1  False
2  False
3  True

What is the easiest way to split this DataFrame by columns into two different DataFrames by transposing dfbool so you get the desired output

I cannot understand, in my limited experience why dfTrue = df[dfBool.transpose() == True] does not work

924

asked May 23 '16 12:05

mohitos

1 Answers

I would like to modify EdChum's comment, because if dfBool is DataFrame, you have to first select column:

import pandas as pd

df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
                    'A': {0: 1, 1: 2, 2: 3},
                    'C': {0: 2, 1: 7, 2: 5},
                    'B': {0: 4, 1: 5, 2: 6}})
print (df)
   A  B  C  D
0  1  4  2  6
1  2  5  7  4
2  3  6  5  6

dfBool = pd.DataFrame({'a':[True, False, False, True]})
print (dfBool)
       a
0   True
1  False
2  False
3   True

#select first column in dfBool
df2 = (dfBool.iloc[:,0])
#or select column a in dfBool
#df2 = (dfBool.a)
print (df2)
0     True
1    False
2    False
3     True
Name: a, dtype: bool

print (df[df.columns[df2]])
   A  D
0  1  6
1  2  4
2  3  6

print (df[df.columns[~df2]])
   B  C
0  4  2
1  5  7
2  6  5

Another very nice solution from ayhan, thank you:

print (df.loc[:, dfBool.a.values])
   A  D
0  1  6
1  2  4
2  3  6

print (df.loc[:, ~dfBool.a.values])
   B  C
0  4  2
1  5  7
2  6  5

But if dfBool is Series, solution works very well:

dfBool = pd.Series([True, False, False, True])
print (dfBool)

0     True
1    False
2    False
3     True
dtype: bool

print (df[df.columns[dfBool]])
   A  D
0  1  6
1  2  4
2  3  6

print (df[df.columns[~dfBool]])
   B  C
0  4  2
1  5  7
2  6  5

And for Series:

print (df.loc[:, dfBool.values])
   A  D
0  1  6
1  2  4
2  3  6

print (df.loc[:, ~dfBool.values])
   B  C
0  4  2
1  5  7
2  6  5

Timings:

In [277]: %timeit (df[df.columns[dfBool.a]])
1000 loops, best of 3: 769 µs per loop

In [278]: %timeit (df.loc[:, dfBool1.a.values])
The slowest run took 9.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 380 µs per loop

In [279]: %timeit (df.transpose()[dfBool1.a.values].transpose())
The slowest run took 5.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 550 µs per loop

Code for timings:

import pandas as pd

df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
                    'A': {0: 1, 1: 2, 2: 3},
                    'C': {0: 2, 1: 7, 2: 5},
                    'B': {0: 4, 1: 5, 2: 6}})
print (df)
df = pd.concat([df]*1000, axis=1).reset_index(drop=True)

dfBool = pd.DataFrame({'a': [True, False, False, True]})
dfBool1 = pd.concat([dfBool]*1000).reset_index(drop=True)

Output is little different:

print (df[df.columns[dfBool.a]])
   A  A  A  A  A  A  A  A  A  A ...  D  D  D  D  D  D  D  D  D  D
0  1  1  1  1  1  1  1  1  1  1 ...  6  6  6  6  6  6  6  6  6  6
1  2  2  2  2  2  2  2  2  2  2 ...  4  4  4  4  4  4  4  4  4  4
2  3  3  3  3  3  3  3  3  3  3 ...  6  6  6  6  6  6  6  6  6  6

[3 rows x 2000 columns]

print (df.loc[:, dfBool1.a.values])
   A  D  A  D  A  D  A  D  A  D ...  A  D  A  D  A  D  A  D  A  D
0  1  6  1  6  1  6  1  6  1  6 ...  1  6  1  6  1  6  1  6  1  6
1  2  4  2  4  2  4  2  4  2  4 ...  2  4  2  4  2  4  2  4  2  4
2  3  6  3  6  3  6  3  6  3  6 ...  3  6  3  6  3  6  3  6  3  6

[3 rows x 2000 columns]

print (df.transpose()[dfBool1.a.values].transpose())
   A  D  A  D  A  D  A  D  A  D ...  A  D  A  D  A  D  A  D  A  D
0  1  6  1  6  1  6  1  6  1  6 ...  1  6  1  6  1  6  1  6  1  6
1  2  4  2  4  2  4  2  4  2  4 ...  2  4  2  4  2  4  2  4  2  4
2  3  6  3  6  3  6  3  6  3  6 ...  3  6  3  6  3  6  3  6  3  6

[3 rows x 2000 columns]

165

answered Sep 30 '22 08:09

jezrael

Related questions
                            
                                Get non-duplicate rows from numpy array
                            
                                How to properly numref table in Sphinx?
                            
                                Avoiding infinite recursion with os.walk
                            
                                How to calculate the inverse of the log normal cumulative distribution function in python?
                            
                                which python neo4j drivers are stable/production ready?
                            
                                Can i press two keys simultaneously for a single event using Pygame?
                            
                                How can I use threading in Python to parallelize AWS S3 API calls?
                            
                                Define a column type as 'list' in Pandas
                            
                                flask sqlalchemy multiple foreign keys in relationship
                            
                                Flask-SQLAlchemy - TypeError: __init__() takes only 1 position
                            
                                sklearn.tree.export_graphviz alternatives
                            
                                'exit' is not a keyword in Python, but no error occurs while using it
                            
                                Removing intersection between data frame based on multiple columns
                            
                                What is a right way for REST API response?
                            
                                Python one liner to substitute a list indices
                            
                                Pandas: Convert lists within a single column to multiple columns
                            
                                How i can disable alembic logging at runtime?
                            
                                High-dimensional data structure in Python
                            
                                How to sort a list of strings with a different order?
                            
                                Stanford CoreNLP OpenIE annotator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With