Given a dataframe, I though that the following were True:
df[(condition_1) | (condition_2)] <=> df[(condition_2) | (condition_1)]
as in
df[(df.col1==1) | (df.col1==2)] <=> df[(df.col1==2) | (df.col1==1)]
But it turns out that it fails in the following situation, where it involves NaN which is probably the reason why it fails:
df = pd.DataFrame([[np.nan, "abc", 2], ["abc", 2, 3], [np.nan, 5,6], [8,9,10]], columns=["A", "B", "C"])
df
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
3 8 9 10
The following works as expected:
df[(df.A.isnull()) | (df.A.str.startswith("a"))]
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
But if I commute the elements, I get a different result:
df[(df.A.str.startswith("a")) | (df.A.isnull())]
A B C
1 abc 2 3
I think that the problems comes from this condition:
df.A.str.startswith("a")
0 NaN
1 True
2 NaN
3 NaN
Name: A, dtype: object
Where I have NaN instead of False.
More precisely, let's C1 = (df.A.str.startswith("a")) and C2 = (df.A.isnull()):
with:
C1 C2
NaN True
True False
NaN True
NaN False
We have:
C1 | C2
0 False
1 True
2 False
3 False
Name: A, dtype: bool
Here C2 is not evaluated, and NaN becomes False.
And here:
C2 | C1
0 True
1 True
2 True
3 False
Name: A, dtype: bool
Where NaN is False (it returns all False with an &) but both conditions are evaluated.
Clearly: C1 | C2 != C2 | C1
I wouldn't mind that NaN produce weird results as long as the commutativity is preserved, but here there is one condition that is not evaluated.
Actually the NaN in the input isn't the problem, because you have the same problem on column B:
(df.B.str.startswith("a")) | (df.B==2) != (df.B==2) | (df.B.str.startswith("a"))
It's because applying str method on other objects returns NaN*, which if evaluated first prevents the second condition to be evaluated. So the main problem remains.
*(can be chosen with str.startswith("a", na=False) as @ayhan noticed)
After some research, I am rather sure that this is a bug in pandas. I was not able to find the specific reason in their code but my conclusion is that either you should be forbidden to do the comparison at all or there is a bug in evaluating the | expression. You can reproduce the problem with a very simple example, namely:
import numpy as np
import pandas as pd
a = pd.Series(np.nan)
b = pd.Series(True)
print( a | b ) # Gives False
print( b | a ) # Gives True
The second result is obviously the correct one. I can only guess the reason, why the first one fails, due to my lack of understanding pandas code base. So if I am mistaken, please correct me or if you feel this is not enough of an answer, please let me know.
Generally, np.nan is treated as True all throughout python, as you can easily check:
import numpy as np
if np.nan:
print("I am True")
This is valid in numpy and even pandas as well, as you can see by doing:
import numpy as np
import pandas as pd
if np.all(np.array([np.nan])):
print("I am True in numpy")
if pd.Series(np.nan).astype("bool").bool():
print("and in pandas")
or by simply doing pd.Series([np.nan]).astype("bool").
So far everything is consistent. The problem now arises when you do | with a Series containing NaNs. There are multiple other people with similar problems, as for example in this question or that blog post (which is for an older version, though). And none give a satisfactory answer to the problem. The only answer to the linked question actually gives no good reason, as | does not even behave the same way as it would for numpy arrays containing the same information. For numpy, np.array(np.nan) | np.array(True) and np.array(np.nan) | np.array(1.0) actually give a TypeError as np.bitwise_or is not able to work on floats.
Due to the inconsistent behavior and the lack of any documentation on this, I can only conclude that this is a bug. As a workaround you can fall back to the solution proposed by @ayhan and use the na parameter, if that exists for all functions that you need. You could also use .astype("bool") on the Series/Dataframe that you want to compare. Note however, that this will convert NaN to True as this is the usual python convention (see this answer for example). If you want to avoid that, you can use .fillna(False).astype("bool"), which I found here. Generally, one should probably file a bug report with pandas, as this behavior is obviously inconsistent!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With