Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Logical Or/bitwise OR in pandas Data Frame

I am trying to use a Boolean mask to get a match from 2 different dataframes. U

Using the logical OR operator:

x = df[(df['A'].isin(df2['B']))
      or df['A'].isin(df2['C'])]

Output:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

However using the bitwise OR operator, the results are returned successfully.

x = df[(df['A'].isin(df2['B']))
      | df['A'].isin(df2['C'])]

Output: x

Is there a difference in both and would bitwise OR be the best option here? Why doesn't the logical OR work?

like image 275
BernardL Avatar asked Sep 08 '16 10:09

BernardL


1 Answers

As far as I have come to understand this issue (coming from a C++ background and currently learning Python for data sciences) I stumbled upon several posts suggesting that bitwise operators (&, |) can be overloaded in classes, just like C++ does.

So basically, while you may use such bitwise operators on numbers they will compare the bits and give you the result. So for instance, if you have the following:

1 | 2 # will result in 3

What Python will actually do is compare the bits of these numbers:

00000001 | 00000010

The result will be:

00000011 (because 0 | 0 is False, ergo 0; and 0 | 1 is True, ergo 1)

As an integer: 3

It compares each bit of the numbers and spit out the result of these eight consecutive operations. This is the normal behaviour of these operators.

Enter Pandas. As you can overload these operators, Pandas has made use of this. So what bitwise operators do when coming to pandas dataframes, is the following:

(dataframe1['column'] == "expression") & (dataframe1['column'] != "another expression)

In this case, first pandas will create a series of trues or falses depending on the result of the == and != operations (be careful: you have to put braces around the outer expressions because python will always try to resolve first bitwise operators and THEN the other comparision operators!!). So it will compare each value in the column to the expression and either output a true or a false.

Then you'd have two same-length series of trues and falses. What it THEN does is take these two serieses and basically compare them with either "and" (&) or "or" (|), and finally spit out one single series either fulfilling or not fulfilling all three comparision operations.

To go even further, what I think is happening under the hood is that the &-operator actually calls a function of pandas, gives them both previously evaluated operations (so the two serieses to the left and right of the operator) and pandas then compares two distinct values at a time, returning a True or False depending on the internal mechanism to determine this.

This is basically the same principle they've used for all other operators as well (>, <, >=, <=, ==, !=).

Why do the struggle and use a different &-expression when you got the nice and neat "and"? Well, that seems to be because "and" is just hard coded and cannot be altered manually.

Hope that helps!

like image 55
nathan_lesage Avatar answered Oct 18 '22 14:10

nathan_lesage