Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a Boolean search with multiple columns in pandas

Tags:

python

pandas

I have a pandas df and would like to accomplish something along these lines (in SQL terms):

SELECT * FROM df WHERE column1 = 'a' OR column2 = 'b' OR column3 = 'c' etc. 

Now this works, for one column/value pair:

foo = df.loc[df['column']==value] 

However, I'm not sure how to expand that to multiple column/value pairs.

  • To be clear, each column matches a different value.
like image 685
Tyler Wood Avatar asked Mar 20 '14 22:03

Tyler Wood


People also ask

Can you sort by multiple columns pandas?

You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order. To specify the order, you have to use ascending boolean property; False for descending and True for ascending.

Is boolean indexing possible in DataFrame?

Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We need a DataFrame with a boolean index to use the boolean indexing.

What is boolean indexing pandas?

Boolean indexing is a type of indexing that uses actual values of the data in the DataFrame. In boolean indexing, we can filter a data in four ways: Accessing a DataFrame with a boolean index. Applying a boolean mask to a dataframe. Masking data based on column value.

What is stacking in pandas?

Pandas DataFrame: stack() function The stack() function is used to stack the prescribed level(s) from columns to index. Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame.


1 Answers

You need to enclose multiple conditions in braces due to operator precedence and use the bitwise and (&) and or (|) operators:

foo = df[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')] 

If you use and or or, then pandas is likely to moan that the comparison is ambiguous. In that case, it is unclear whether we are comparing every value in a series in the condition, and what does it mean if only 1 or all but 1 match the condition. That is why you should use the bitwise operators or the numpy np.all or np.any to specify the matching criteria.

There is also the query method: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.query.html

but there are some limitations mainly to do with issues where there could be ambiguity between column names and index values.

like image 82
EdChum Avatar answered Sep 28 '22 14:09

EdChum