Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas column selection: non commutative bitwise OR when selecting on str and NaN

Tags:

python

pandas

Introduction:

Given a dataframe, I though that the following were True:

df[(condition_1) | (condition_2)] <=> df[(condition_2) | (condition_1)]

as in

df[(df.col1==1) | (df.col1==2)] <=> df[(df.col1==2) | (df.col1==1)]

Problem:

But it turns out that it fails in the following situation, where it involves NaN which is probably the reason why it fails:

df = pd.DataFrame([[np.nan, "abc", 2], ["abc", 2, 3], [np.nan, 5,6], [8,9,10]], columns=["A", "B", "C"])
df
     A    B   C
0  NaN  abc   2
1  abc    2   3
2  NaN    5   6
3    8    9  10

The following works as expected:

df[(df.A.isnull()) | (df.A.str.startswith("a"))]
     A    B  C
0  NaN  abc  2
1  abc    2  3
2  NaN    5  6

But if I commute the elements, I get a different result:

df[(df.A.str.startswith("a")) | (df.A.isnull())]
     A  B  C
1  abc  2  3

I think that the problems comes from this condition:

df.A.str.startswith("a")
0     NaN
1    True
2     NaN
3     NaN
Name: A, dtype: object

Where I have NaN instead of False.

Questions:

  • Is this behavior expected? Is it a bug ? Because it can lead to potential loss of data if one is not expecting this kind of behavior.
  • Why it behaves like this (in a non commutative way) ?

More details:

More precisely, let's C1 = (df.A.str.startswith("a")) and C2 = (df.A.isnull()):

with:

  C1     C2
 NaN   True
True  False
 NaN   True
 NaN  False

We have:

C1 | C2
0    False
1     True
2    False
3    False
Name: A, dtype: bool

Here C2 is not evaluated, and NaN becomes False.

And here:

C2 | C1
0     True
1     True
2     True
3    False
Name: A, dtype: bool

Where NaN is False (it returns all False with an &) but both conditions are evaluated.

Clearly: C1 | C2 != C2 | C1

I wouldn't mind that NaN produce weird results as long as the commutativity is preserved, but here there is one condition that is not evaluated.

Actually the NaN in the input isn't the problem, because you have the same problem on column B:

(df.B.str.startswith("a")) | (df.B==2) != (df.B==2) | (df.B.str.startswith("a"))

It's because applying str method on other objects returns NaN*, which if evaluated first prevents the second condition to be evaluated. So the main problem remains.

*(can be chosen with str.startswith("a", na=False) as @ayhan noticed)

like image 206
jrjc Avatar asked Oct 19 '22 03:10

jrjc


1 Answers

After some research, I am rather sure that this is a bug in pandas. I was not able to find the specific reason in their code but my conclusion is that either you should be forbidden to do the comparison at all or there is a bug in evaluating the | expression. You can reproduce the problem with a very simple example, namely:

import numpy as np
import pandas as pd

a = pd.Series(np.nan)
b = pd.Series(True)

print( a | b )  # Gives False
print( b | a )  # Gives True

The second result is obviously the correct one. I can only guess the reason, why the first one fails, due to my lack of understanding pandas code base. So if I am mistaken, please correct me or if you feel this is not enough of an answer, please let me know.

Generally, np.nan is treated as True all throughout python, as you can easily check:

import numpy as np
if np.nan:
    print("I am True")

This is valid in numpy and even pandas as well, as you can see by doing:

import numpy as np
import pandas as pd
if np.all(np.array([np.nan])):
    print("I am True in numpy")
if pd.Series(np.nan).astype("bool").bool():
    print("and in pandas")

or by simply doing pd.Series([np.nan]).astype("bool").

So far everything is consistent. The problem now arises when you do | with a Series containing NaNs. There are multiple other people with similar problems, as for example in this question or that blog post (which is for an older version, though). And none give a satisfactory answer to the problem. The only answer to the linked question actually gives no good reason, as | does not even behave the same way as it would for numpy arrays containing the same information. For numpy, np.array(np.nan) | np.array(True) and np.array(np.nan) | np.array(1.0) actually give a TypeError as np.bitwise_or is not able to work on floats.

Due to the inconsistent behavior and the lack of any documentation on this, I can only conclude that this is a bug. As a workaround you can fall back to the solution proposed by @ayhan and use the na parameter, if that exists for all functions that you need. You could also use .astype("bool") on the Series/Dataframe that you want to compare. Note however, that this will convert NaN to True as this is the usual python convention (see this answer for example). If you want to avoid that, you can use .fillna(False).astype("bool"), which I found here. Generally, one should probably file a bug report with pandas, as this behavior is obviously inconsistent!

like image 102
jotasi Avatar answered Oct 21 '22 05:10

jotasi