Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas filter columns of a DataFrame with bool

For a DataFrame (df) with multiple columns and rows

     A   B  C  D
0    1   4  2  6
1    2   5  7  4
2    3   6  5  6

and another DataFrame (dfBool) containing dtype: bool

0  True
1  False
2  False
3  True

What is the easiest way to split this DataFrame by columns into two different DataFrames by transposing dfbool so you get the desired output

     A   D
0    1   6
1    2   4
2    3   6 

     B  C 
0    4  2  
1    5  7  
2    6  5  

I cannot understand, in my limited experience why dfTrue = df[dfBool.transpose() == True] does not work

like image 924
mohitos Avatar asked May 23 '16 12:05

mohitos


People also ask

How to filter DataFrames with Boolean masks in pandas?

To filter DataFrames with Boolean Masks we use the index operator and pass a comparison for a specific column. In the example below, pandas will filter all rows for sales greater than 1000.

Why do we need to filter pandas Dataframe with multiple conditions?

The reason is dataframe may be having multiple columns and multiple rows. Selective display of columns with limited rows is always the expected view of users. To fulfill the user’s expectations and also help in machine deep learning scenarios, filtering of Pandas dataframe with multiple conditions is much necessary.

How to select rows of pandas Dataframe based on column value?

And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course Method 1: Selecting rows of Pandas Dataframe based on particular column value using ‘>’, ‘=’, ‘=’, ‘<=’, ‘!=’ operator. Example 1: Selecting all the rows from the given Dataframe in which ‘Percentage’ is greater than 75 using [ ].

How to filter Dataframe by single column value in R?

Here we are going to filter dataframe by single column value by using loc [] function. This function will take column name as input and filter the data using relational operators. column refers the dataframe column name where value is filtered in this column value is the string/numeric data compared with actual column value in the dataframe


1 Answers

I would like to modify EdChum's comment, because if dfBool is DataFrame, you have to first select column:

import pandas as pd

df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
                    'A': {0: 1, 1: 2, 2: 3},
                    'C': {0: 2, 1: 7, 2: 5},
                    'B': {0: 4, 1: 5, 2: 6}})
print (df)
   A  B  C  D
0  1  4  2  6
1  2  5  7  4
2  3  6  5  6

dfBool = pd.DataFrame({'a':[True, False, False, True]})
print (dfBool)
       a
0   True
1  False
2  False
3   True
#select first column in dfBool
df2 = (dfBool.iloc[:,0])
#or select column a in dfBool
#df2 = (dfBool.a)
print (df2)
0     True
1    False
2    False
3     True
Name: a, dtype: bool

print (df[df.columns[df2]])
   A  D
0  1  6
1  2  4
2  3  6

print (df[df.columns[~df2]])
   B  C
0  4  2
1  5  7
2  6  5

Another very nice solution from ayhan, thank you:

print (df.loc[:, dfBool.a.values])
   A  D
0  1  6
1  2  4
2  3  6

print (df.loc[:, ~dfBool.a.values])
   B  C
0  4  2
1  5  7
2  6  5

But if dfBool is Series, solution works very well:

dfBool = pd.Series([True, False, False, True])
print (dfBool)

0     True
1    False
2    False
3     True
dtype: bool

print (df[df.columns[dfBool]])
   A  D
0  1  6
1  2  4
2  3  6

print (df[df.columns[~dfBool]])
   B  C
0  4  2
1  5  7
2  6  5

And for Series:

print (df.loc[:, dfBool.values])
   A  D
0  1  6
1  2  4
2  3  6

print (df.loc[:, ~dfBool.values])
   B  C
0  4  2
1  5  7
2  6  5

Timings:

In [277]: %timeit (df[df.columns[dfBool.a]])
1000 loops, best of 3: 769 µs per loop

In [278]: %timeit (df.loc[:, dfBool1.a.values])
The slowest run took 9.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 380 µs per loop

In [279]: %timeit (df.transpose()[dfBool1.a.values].transpose())
The slowest run took 5.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 550 µs per loop

Code for timings:

import pandas as pd

df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
                    'A': {0: 1, 1: 2, 2: 3},
                    'C': {0: 2, 1: 7, 2: 5},
                    'B': {0: 4, 1: 5, 2: 6}})
print (df)
df = pd.concat([df]*1000, axis=1).reset_index(drop=True)

dfBool = pd.DataFrame({'a': [True, False, False, True]})
dfBool1 = pd.concat([dfBool]*1000).reset_index(drop=True)

Output is little different:

print (df[df.columns[dfBool.a]])
   A  A  A  A  A  A  A  A  A  A ...  D  D  D  D  D  D  D  D  D  D
0  1  1  1  1  1  1  1  1  1  1 ...  6  6  6  6  6  6  6  6  6  6
1  2  2  2  2  2  2  2  2  2  2 ...  4  4  4  4  4  4  4  4  4  4
2  3  3  3  3  3  3  3  3  3  3 ...  6  6  6  6  6  6  6  6  6  6

[3 rows x 2000 columns]

print (df.loc[:, dfBool1.a.values])
   A  D  A  D  A  D  A  D  A  D ...  A  D  A  D  A  D  A  D  A  D
0  1  6  1  6  1  6  1  6  1  6 ...  1  6  1  6  1  6  1  6  1  6
1  2  4  2  4  2  4  2  4  2  4 ...  2  4  2  4  2  4  2  4  2  4
2  3  6  3  6  3  6  3  6  3  6 ...  3  6  3  6  3  6  3  6  3  6

[3 rows x 2000 columns]

print (df.transpose()[dfBool1.a.values].transpose())
   A  D  A  D  A  D  A  D  A  D ...  A  D  A  D  A  D  A  D  A  D
0  1  6  1  6  1  6  1  6  1  6 ...  1  6  1  6  1  6  1  6  1  6
1  2  4  2  4  2  4  2  4  2  4 ...  2  4  2  4  2  4  2  4  2  4
2  3  6  3  6  3  6  3  6  3  6 ...  3  6  3  6  3  6  3  6  3  6

[3 rows x 2000 columns]
like image 165
jezrael Avatar answered Sep 30 '22 08:09

jezrael