Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pandas '==' different than '.eq()'

Tags:

python

pandas

Consider the series s

s = pd.Series([(1, 2), (3, 4), (5, 6)])

This is as expected

s == (3, 4)

0    False
1     True
2    False
dtype: bool

This is not

s.eq((3, 4))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

ValueError: Lengths must be equal

I was under the assumption they were the same. What is the difference between them?


What does the documentation say?

Equivalent to series == other, but with support to substitute a fill_value for missing data in one of the inputs.

This seems to imply that they should work the same, hence the confusion.

like image 764
piRSquared Avatar asked Dec 31 '16 16:12

piRSquared


People also ask

Why do we use pandas as PD?

The import pandas portion of the code tells Python to bring the pandas data analysis library into your current environment. The as pd portion of the code then tells Python to give pandas the alias of pd. This allows you to use pandas functions by simply typing pd.

Is PyArrow faster than pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.


2 Answers

What you encounter is actually a special case that makes it easier to compare pandas.Series or numpy.ndarray with normal python constructs. The source code reads:

def flex_wrapper(self, other, level=None, fill_value=None, axis=0):
    # validate axis
    if axis is not None:
        self._get_axis_number(axis)
    if isinstance(other, ABCSeries):
        return self._binop(other, op, level=level, fill_value=fill_value)
    elif isinstance(other, (np.ndarray, list, tuple)):
        if len(other) != len(self):
            # ---------------------------------------
            # you never reach the `==` path because you get into this.
            # ---------------------------------------
            raise ValueError('Lengths must be equal')  
        return self._binop(self._constructor(other, self.index), op,
                           level=level, fill_value=fill_value)
    else:
        if fill_value is not None:
            self = self.fillna(fill_value)

        return self._constructor(op(self, other),
                                 self.index).__finalize__(self)

You're hitting the ValueError because pandas assumes for .eq that you wanted the value converted to a numpy.ndarray or pandas.Series (if you give it an array, list or tuple) instead of actually comparing it to the tuple. For example if you have:

s = pd.Series([1,2,3])
s.eq([1,2,3])

you wouldn't want it to compare each element to [1,2,3].

The problem is that object arrays (as with dtype=uint) often slip through the cracks or are neglected on purpose. A simple if self.dtype != 'object' branch inside that method could resolve this issue. But maybe the developers had strong reasons to actually make this case different. I would advise to ask for clarification by posting on their bug tracker.


You haven't asked how you can make it work correctly but for completness I'll include one possibility (according to the source code it seems likely you need to wrap it as pandas.Series yourself):

>>> s.eq(pd.Series([(1, 2)]))
0     True
1    False
2    False
dtype: bool
like image 52
MSeifert Avatar answered Oct 28 '22 14:10

MSeifert


== is an element-wise comparison that yields a vector of truth values, whereas .eq is a "those two iterables are equal", for which being of the same length is a requirement. Ayhan points to one exception: when you compare a pandas vector type using .eq(scalar value), the scalar value is simply broadcast to a vector of the same size for comparison.

like image 37
Marcus Müller Avatar answered Oct 28 '22 13:10

Marcus Müller