The implicit index-matching of pandas
for operations between different DataFrame
/Series
is great and most of the times, it just works.
However, I've stumbled on an example that does not work as expected:
import pandas as pd # 0.21.0
import numpy as np # 1.13.3
x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])
# logical AND: this works, symmetric as it should be
pd.concat([x, y, x & y, y & x], keys = ['x', 'y', 'x&y', 'y&x'], axis = 1)
# x y x&y y&x
# 0 True NaN False False
# 1 False NaN False False
# 2 True False False False
# 3 True True True True
# 4 NaN True False False
# 5 NaN False False False
# but logical OR is not symmetric anymore (same for XOR: x^y vs. y^x)
pd.concat([x, y, x | y, y | x], keys = ['x', 'y', 'x|y', 'y|x'], axis = 1)
# x y x|y y|x
# 0 True NaN True False <-- INCONSISTENT!
# 1 False NaN False False
# 2 True False True True
# 3 True True True True
# 4 NaN True False True <-- INCONSISTENT!
# 5 NaN False False False
Researching a bit, I found two points that seem relevant:
bool(np.nan)
equals True
, cf. https://stackoverflow.com/a/15686477/2965879
|
is resolved to np.bitwise_or
, rather than np.logical_or
, cf. https://stackoverflow.com/a/37132854/2965879
But ultimately, the kicker seems to be that pandas does casting from nan
to False
at some point. Looking at the above, it appears that this happens after calling np.bitwise_or
, while I think this should happen before?
In particular, using np.logical_or
does not help because it misses the index alignment that pandas
does, and also, I don't want np.nan or False
to equal True
. (In other words, the answer https://stackoverflow.com/a/37132854/2965879 does not help.)
I think that if this wonderful syntactic sugar is provided, it should be as consistent as possible*, and so |
should be symmetric. It's really hard to debug (as happened to me) when something that's always symmetric suddenly isn't anymore.
So finally, the question: Is there any feasible workaround (e.g. overloading something) to salvage x|y == y|x
, and ideally in such a way that (loosely speaking) nan | True == True == True | nan
and nan | False == False == False | nan
?
*even if De Morgan's law falls apart regardless - ~(x&y)
can not fully match ~y|~x
because the NaNs only come in at the index alignment (and so are not affected by a previous negation).
After doing some exploring in pandas, I discovered that there is a function called pandas.core.ops._bool_method_SERIES
which is one of several factory functions that wrap the boolean operators for Series objects.
>>> f = pandas.Series.__or__
>>> f #the actual function you call when you do x|y
<function _bool_method_SERIES.<locals>.wrapper at 0x107436bf8>
>>> f.__closure__[0].cell_contents
#it holds a reference to the other function defined in this factory na_op
<function _bool_method_SERIES.<locals>.na_op at 0x107436b70>
>>> f.__closure__[0].cell_contents.__closure__[0].cell_contents
#and na_op has a reference to the built-in function or_
<built-in function or_>
This means we could theoretically define our own method that would perform a logical or with the correct logic, first let's see what it will actually do (remember an operator function is expected to raise a TypeError if the operation can't be performed)
def test_logical_or(a,b):
print("**** calling logical_or with ****")
print(type(a), a)
print(type(b), b)
print("******")
raise TypeError("my_logical_or isn't implemented")
#make the wrapper method
wrapper = pd.core.ops._bool_method_SERIES(test_logical_or, None,None)
pd.Series.logical_or = wrapper #insert method
x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])
z = x.logical_or(y) #lets try it out!
print(x,y,z, sep="\n")
When this gets run (at least with pandas vs 0.19.1)
**** calling logical_or with ****
<class 'numpy.ndarray'> [True False True True nan nan]
<class 'numpy.ndarray'> [False False False True True False]
******
**** calling logical_or with ****
<class 'bool'> True
<class 'bool'> False
******
Traceback (most recent call last):
...
So it looks like it tried to call our method with two numpy arrays, where for whatever reason the second one has the nan
values already replaced with False
but not the first one which is likely why our symmetry breaks. and then when that failed it tried again I'd assume element-wise.
So as a bare minimum to get this working you can just explicitly check that both arguments are numpy arrays, try to convert all the nan
entries of the first to False
then return np.logical_or(a,b)
. I'm going to assume if anything else is the case we will just raise an error.
def my_logical_or(a,b):
if isinstance(a, np.ndarray) and isinstance(b, np.ndarray):
a[np.isnan(a.astype(float))] = False
b[np.isnan(b.astype(float))] = False
return np.logical_or(a,b)
else:
raise TypeError("custom logical or is only implemented for numpy arrays")
wrapper = pd.core.ops._bool_method_SERIES(my_logical_or, None,None)
pd.Series.logical_or = wrapper
x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])
z = pd.concat([x, y, x.logical_or(y), y.logical_or(x)], keys = ['x', 'y', 'x|y', 'y|x'], axis = 1)
print(z)
# x y x|y y|x
# 0 True NaN True True
# 1 False NaN False False <-- same!
# 2 True False True True
# 3 True True True True
# 4 NaN True True True <-- same!
# 5 NaN False False False
So that could be your workaround, I would not recommend modifying Series.__or__
since we don't know who else would be using it and don't want to break any code that expects the default behaviour.
Alternatively, we can modify the source code at pandas.core.ops
line 943 to fill NaN
values with False (or 0) for self
in the same way it does with other
, so we'd change the line:
return filler(self._constructor(na_op(self.values, other.values),
index=self.index, name=name))
to use filler(self).values
instead of self.values
:
return filler(self._constructor(na_op(filler(self).values, other.values),
index=self.index, name=name))
This also fixes the issue with or
and xor
not being symmetric, however, I would not recommend this since it may break other code, I personally don't have nearly enough experience with pandas to determine what this would change in different circumstances.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With