Given a dataframe, I though that the following were True:
df[(condition_1) | (condition_2)] <=> df[(condition_2) | (condition_1)]
as in
df[(df.col1==1) | (df.col1==2)] <=> df[(df.col1==2) | (df.col1==1)]
But it turns out that it fails in the following situation, where it involves NaN
which is probably the reason why it fails:
df = pd.DataFrame([[np.nan, "abc", 2], ["abc", 2, 3], [np.nan, 5,6], [8,9,10]], columns=["A", "B", "C"])
df
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
3 8 9 10
The following works as expected:
df[(df.A.isnull()) | (df.A.str.startswith("a"))]
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
But if I commute the elements, I get a different result:
df[(df.A.str.startswith("a")) | (df.A.isnull())]
A B C
1 abc 2 3
I think that the problems comes from this condition:
df.A.str.startswith("a")
0 NaN
1 True
2 NaN
3 NaN
Name: A, dtype: object
Where I have NaN
instead of False
.
More precisely, let's C1 = (df.A.str.startswith("a"))
and C2 = (df.A.isnull())
:
with:
C1 C2
NaN True
True False
NaN True
NaN False
We have:
C1 | C2
0 False
1 True
2 False
3 False
Name: A, dtype: bool
Here C2 is not evaluated, and NaN becomes False.
And here:
C2 | C1
0 True
1 True
2 True
3 False
Name: A, dtype: bool
Where NaN is False (it returns all False with an &
) but both conditions are evaluated.
Clearly: C1 | C2 != C2 | C1
I wouldn't mind that NaN
produce weird results as long as the commutativity is preserved, but here there is one condition that is not evaluated.
Actually the NaN in the input isn't the problem, because you have the same problem on column B
:
(df.B.str.startswith("a")) | (df.B==2) != (df.B==2) | (df.B.str.startswith("a"))
It's because applying str
method on other objects returns NaN
*, which if evaluated first prevents the second condition to be evaluated. So the main problem remains.
*(can be chosen with str.startswith("a", na=False)
as @ayhan noticed)
After some research, I am rather sure that this is a bug in pandas
. I was not able to find the specific reason in their code but my conclusion is that either you should be forbidden to do the comparison at all or there is a bug in evaluating the |
expression. You can reproduce the problem with a very simple example, namely:
import numpy as np
import pandas as pd
a = pd.Series(np.nan)
b = pd.Series(True)
print( a | b ) # Gives False
print( b | a ) # Gives True
The second result is obviously the correct one. I can only guess the reason, why the first one fails, due to my lack of understanding pandas
code base. So if I am mistaken, please correct me or if you feel this is not enough of an answer, please let me know.
Generally, np.nan
is treated as True
all throughout python, as you can easily check:
import numpy as np
if np.nan:
print("I am True")
This is valid in numpy
and even pandas
as well, as you can see by doing:
import numpy as np
import pandas as pd
if np.all(np.array([np.nan])):
print("I am True in numpy")
if pd.Series(np.nan).astype("bool").bool():
print("and in pandas")
or by simply doing pd.Series([np.nan]).astype("bool")
.
So far everything is consistent. The problem now arises when you do |
with a Series
containing NaN
s. There are multiple other people with similar problems, as for example in this question or that blog post (which is for an older version, though). And none give a satisfactory answer to the problem. The only answer to the linked question actually gives no good reason, as |
does not even behave the same way as it would for numpy
arrays containing the same information. For numpy, np.array(np.nan) | np.array(True)
and np.array(np.nan) | np.array(1.0)
actually give a TypeError
as np.bitwise_or
is not able to work on floats.
Due to the inconsistent behavior and the lack of any documentation on this, I can only conclude that this is a bug. As a workaround you can fall back to the solution proposed by @ayhan and use the na
parameter, if that exists for all functions that you need. You could also use .astype("bool")
on the Series
/Dataframe
that you want to compare. Note however, that this will convert NaN
to True
as this is the usual python
convention (see this answer for example). If you want to avoid that, you can use .fillna(False).astype("bool")
, which I found here. Generally, one should probably file a bug report with pandas, as this behavior is obviously inconsistent!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With