Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why isn't pandas logical operator aligning on the index like it should?

Tags:

python

pandas

Consider this simple setup:

x = pd.Series([1, 2, 3], index=list('abc')) y = pd.Series([2, 3, 3], index=list('bca'))  x  a    1 b    2 c    3 dtype: int64  y  b    2 c    3 a    3 dtype: int64 

As you can see, the indexes are the same, just in a different order.

Now, consider a simple logical comparison using the equality (==) operator:

x == y --------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) 

This throws a ValueError, most likely because the indexes do not match. On the other hand, calling the equivalent eq operator works:

x.eq(y)  a    False b     True c     True dtype: bool 

OTOH, the operator method works given y is first reordered...

x == y.reindex_like(x)  a    False b     True c     True dtype: bool 

My understanding was that the function and operator comparison should do the same thing, all other things equal. What is eq doing that the operator comparison doesn't?

like image 341
cs95 Avatar asked Jun 01 '19 00:06

cs95


People also ask

Is Panda index immutable?

In Pandas, Indexes are immutable like dictionary keys. They also assume homogeneity in data type like NumPy arrays. Before we get into the thick of all these, let us quickly remind ourselves of how to create Series, Indexes, and try to modify index names and get into our lesson for the day.

Do pandas index need to be unique?

The column you want to index does not need to have unique values.

What is index error in pandas?

IndexError is an exception in python that occurs when we try to access an element from a list or tuple from an index that is not present in the list. For example, we have a list of 10 elements, the index is in the range 0 to 9.

Can you change the index of a Pandas series?

It is possible to specify or change the index labels of a pandas Series object after creation also. It can be done by using the index attribute of the pandas series constructor.


Video Answer


1 Answers

Viewing the whole traceback for a Series comparison with mismatched indexes, particularly focusing on the exception message:

In [1]: import pandas as pd In [2]: x = pd.Series([1, 2, 3], index=list('abc')) In [3]: y = pd.Series([2, 3, 3], index=list('bca')) In [4]: x == y --------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-4-73b2790c1e5e> in <module>() ----> 1 x == y /usr/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis)    1188     1189         elif isinstance(other, ABCSeries) and not self._indexed_same(othe r): -> 1190             raise ValueError("Can only compare identically-labeled "    1191                              "Series objects")    1192  ValueError: Can only compare identically-labeled Series objects 

we see that this is a deliberate implementation decision. Also, this is not unique to Series objects - DataFrames raise a similar error.

Digging through the Git blame for the relevant lines eventually turns up some relevant commits and issue tracker threads. For example, Series.__eq__ used to completely ignore the RHS's index, and in a comment on a bug report about that behavior, Pandas author Wes McKinney says the following:

This is actually a feature / deliberate choice and not a bug-- it's related to #652. Back in January I changed the comparison methods to do auto-alignment, but found that it led to a large amount of bugs / breakage for users and, in particular, many NumPy functions (which regularly do things like arr[1:] == arr[:-1]; example: np.unique) stopped working.

This gets back to the issue that Series isn't quite ndarray-like enough and should probably not be a subclass of ndarray.

So, I haven't got a good answer for you except for that; auto-alignment would be ideal but I don't think I can do it unless I make Series not a subclass of ndarray. I think this is probably a good idea but not likely to happen until 0.9 or 0.10 (several months down the road).

This was then changed to the current behavior in pandas 0.19.0. Quoting the "what's new" page:

Following Series operators have been changed to make all operators consistent, including DataFrame (GH1134, GH4581, GH13538)

  • Series comparison operators now raise ValueError when index are different.
  • Series logical operators align both index of left and right hand side.

This made the Series behavior match that of DataFrame, which already rejected mismatched indices in comparisons.

In summary, making the comparison operators align indices automatically turned out to break too much stuff, so this was the best alternative.

like image 156
user2357112 supports Monica Avatar answered Sep 29 '22 04:09

user2357112 supports Monica