Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Broken symmetry of operations between Boolean pandas.Series with unequal index

The implicit index-matching of pandas for operations between different DataFrame/Series is great and most of the times, it just works.

However, I've stumbled on an example that does not work as expected:

import pandas as pd # 0.21.0
import numpy as np # 1.13.3
x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])

# logical AND: this works, symmetric as it should be
pd.concat([x, y, x & y, y & x], keys = ['x', 'y', 'x&y', 'y&x'], axis = 1)
#        x      y    x&y    y&x
# 0   True    NaN  False  False
# 1  False    NaN  False  False
# 2   True  False  False  False
# 3   True   True   True   True
# 4    NaN   True  False  False
# 5    NaN  False  False  False

# but logical OR is not symmetric anymore (same for XOR: x^y vs. y^x)
pd.concat([x, y, x | y, y | x], keys = ['x', 'y', 'x|y', 'y|x'], axis = 1)
#        x      y    x|y    y|x
# 0   True    NaN   True  False <-- INCONSISTENT!
# 1  False    NaN  False  False
# 2   True  False   True   True
# 3   True   True   True   True
# 4    NaN   True  False   True <-- INCONSISTENT!
# 5    NaN  False  False  False

Researching a bit, I found two points that seem relevant:

  • bool(np.nan) equals True, cf. https://stackoverflow.com/a/15686477/2965879
  • | is resolved to np.bitwise_or, rather than np.logical_or, cf. https://stackoverflow.com/a/37132854/2965879

But ultimately, the kicker seems to be that pandas does casting from nan to False at some point. Looking at the above, it appears that this happens after calling np.bitwise_or, while I think this should happen before?

In particular, using np.logical_or does not help because it misses the index alignment that pandas does, and also, I don't want np.nan or False to equal True. (In other words, the answer https://stackoverflow.com/a/37132854/2965879 does not help.)

I think that if this wonderful syntactic sugar is provided, it should be as consistent as possible*, and so | should be symmetric. It's really hard to debug (as happened to me) when something that's always symmetric suddenly isn't anymore.

So finally, the question: Is there any feasible workaround (e.g. overloading something) to salvage x|y == y|x, and ideally in such a way that (loosely speaking) nan | True == True == True | nan and nan | False == False == False | nan?

*even if De Morgan's law falls apart regardless - ~(x&y) can not fully match ~y|~x because the NaNs only come in at the index alignment (and so are not affected by a previous negation).

like image 918
Axel Avatar asked Dec 05 '17 17:12

Axel


1 Answers

After doing some exploring in pandas, I discovered that there is a function called pandas.core.ops._bool_method_SERIES which is one of several factory functions that wrap the boolean operators for Series objects.

>>> f = pandas.Series.__or__
>>> f #the actual function you call when you do x|y
<function _bool_method_SERIES.<locals>.wrapper at 0x107436bf8>
>>> f.__closure__[0].cell_contents
    #it holds a reference to the other function defined in this factory na_op
<function _bool_method_SERIES.<locals>.na_op at 0x107436b70>
>>> f.__closure__[0].cell_contents.__closure__[0].cell_contents
    #and na_op has a reference to the built-in function or_
<built-in function or_>

This means we could theoretically define our own method that would perform a logical or with the correct logic, first let's see what it will actually do (remember an operator function is expected to raise a TypeError if the operation can't be performed)

def test_logical_or(a,b):
    print("**** calling logical_or with ****")
    print(type(a), a)
    print(type(b), b)
    print("******")
    raise TypeError("my_logical_or isn't implemented")

#make the wrapper method
wrapper = pd.core.ops._bool_method_SERIES(test_logical_or, None,None)
pd.Series.logical_or = wrapper #insert method


x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])

z = x.logical_or(y) #lets try it out!

print(x,y,z, sep="\n")

When this gets run (at least with pandas vs 0.19.1)

**** calling logical_or with ****
<class 'numpy.ndarray'> [True False True True nan nan]
<class 'numpy.ndarray'> [False False False  True  True False]
******
**** calling logical_or with ****
<class 'bool'> True
<class 'bool'> False
******
Traceback (most recent call last):
   ...

So it looks like it tried to call our method with two numpy arrays, where for whatever reason the second one has the nan values already replaced with False but not the first one which is likely why our symmetry breaks. and then when that failed it tried again I'd assume element-wise.

So as a bare minimum to get this working you can just explicitly check that both arguments are numpy arrays, try to convert all the nan entries of the first to False then return np.logical_or(a,b). I'm going to assume if anything else is the case we will just raise an error.

def my_logical_or(a,b):
    if isinstance(a, np.ndarray) and isinstance(b, np.ndarray):
        a[np.isnan(a.astype(float))] = False
        b[np.isnan(b.astype(float))] = False
        return np.logical_or(a,b)
    else:
        raise TypeError("custom logical or is only implemented for numpy arrays")

wrapper = pd.core.ops._bool_method_SERIES(my_logical_or, None,None)
pd.Series.logical_or = wrapper


x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])

z = pd.concat([x, y, x.logical_or(y), y.logical_or(x)], keys = ['x', 'y', 'x|y', 'y|x'], axis = 1)
print(z)
#        x      y    x|y    y|x
# 0   True    NaN   True   True
# 1  False    NaN  False  False <-- same!
# 2   True  False   True   True
# 3   True   True   True   True
# 4    NaN   True   True   True <-- same!
# 5    NaN  False  False  False

So that could be your workaround, I would not recommend modifying Series.__or__ since we don't know who else would be using it and don't want to break any code that expects the default behaviour.


Alternatively, we can modify the source code at pandas.core.ops line 943 to fill NaN values with False (or 0) for self in the same way it does with other, so we'd change the line:

    return filler(self._constructor(na_op(self.values, other.values),
                                    index=self.index, name=name))

to use filler(self).values instead of self.values:

    return filler(self._constructor(na_op(filler(self).values, other.values),
                                    index=self.index, name=name))

This also fixes the issue with or and xor not being symmetric, however, I would not recommend this since it may break other code, I personally don't have nearly enough experience with pandas to determine what this would change in different circumstances.

like image 59
Tadhg McDonald-Jensen Avatar answered Nov 03 '22 13:11

Tadhg McDonald-Jensen