How can Pandas DataFrames appear identical but fail equals()?

Tags:

To confirm that I understand what Pandas df.groupby() and df.reset_index() do, I attempted to do a round-trip from a dataframe to a grouped version of the same data and back. After the round-trip the columns and rows had to be sorted again, because groupby() affects row order and reset_index() affects column order, but after two quick maneuvers to put the columns and index back in order, the dataframes look identical:

Same list of column names.
Same dtypes for every column.
Corresponding index values are strictly equal.
Corresponding data values are strictly equal.

Yet, after all of these checks succeed, df1.equals(df5) returns the astounding value False.

What difference between these dataframes is equals() uncovering that I have not yet figured out how to check for myself?

Test code:

csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""

import pandas as pd

df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)

df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)

print()
print(repr(df1.columns))
print(repr(df5.columns))
print()
print(df1.dtypes)
print(df5.dtypes)
print()
print(df1 == df5)
print()
print(df1.index == df5.index)
print()
print(df1.equals(df5))

The output that I receive when I run the script is:

                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks
                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks

Index(['title', 'year', 'director'], dtype='object')
Index(['title', 'year', 'director'], dtype='object')

title       object
year         int64
director    object
dtype: object
title       object
year         int64
director    object
dtype: object

  title  year director
0  True  True     True
1  True  True     True
2  True  True     True
3  True  True     True
4  True  True     True

[ True  True  True  True  True]

False

Thanks for any help!

918

asked Mar 27 '15 23:03

Brandon Rhodes

1 Answers

This feels like a bug to me, but could be simply that I'm misunderstanding something. The blocks are listed in a different order:

>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
>>> df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64

In core/internals.py, we have the BlockManager method

def equals(self, other):
    self_axes, other_axes = self.axes, other.axes
    if len(self_axes) != len(other_axes):
        return False
    if not all (ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
        return False
    self._consolidate_inplace()
    other._consolidate_inplace()
    return all(block.equals(oblock) for block, oblock in
               zip(self.blocks, other.blocks))

and that last all assumes that the blocks in self and other correspond. But if we add some print calls before it, we see:

>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False

and so we're comparing the wrong things. The reason I'm not sure whether or not this is a bug is because I'm not sure whether equals is meant to be this finicky or not. If so, I think there's a doc bug, at least, because equals should then shout that it's not meant to be used for what you might think it would be from the name and the docstring.

190

answered Oct 28 '22 11:10

DSM

Related questions
                            
                                Binary Subtraction - Python
                            
                                Decimal field rounding in WTForms
                            
                                Why Numpy has dimension (n,) instead of (n,1) only [duplicate]
                            
                                Use spatialite extension for SQLite on Windows
                            
                                How to connect event when tab widget is selected?
                            
                                Excessive Latency on CORS AJAX Request to Local WSGI Server in Chrome
                            
                                Iterate through positions of a substring in a string
                            
                                error: invalid command 'build_sphinx'
                            
                                Dynamic login_redirect_url in Django allauth
                            
                                Can't index by timestamp in pandas dataframe
                            
                                Non-blocking solution to the dining philosophers
                            
                                Password authentication fails with complex password
                            
                                Google App Engine 'No module named pwd'
                            
                                sendMessage from outside in autobahn running in separate thread
                            
                                Save breakpoints to file
                            
                                Inconsistent object comparison behaviour when inheriting from dict
                            
                                Jinja 2 Templates: how I check in an if statement whether the boolean is False or None
                            
                                What's the equivalent for while (cin >> var) in python?
                            
                                How to allow unverified packages in requirements.txt
                            
                                Pandas: get_dummies vs categorical

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can Pandas DataFrames appear identical but fail equals()?

Tags:

python

pandas

Brandon Rhodes

People also ask

1 Answers

DSM

Recent Activity

Donate For Us