Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can Pandas DataFrames appear identical but fail equals()?

Tags:

python

pandas

To confirm that I understand what Pandas df.groupby() and df.reset_index() do, I attempted to do a round-trip from a dataframe to a grouped version of the same data and back. After the round-trip the columns and rows had to be sorted again, because groupby() affects row order and reset_index() affects column order, but after two quick maneuvers to put the columns and index back in order, the dataframes look identical:

  • Same list of column names.
  • Same dtypes for every column.
  • Corresponding index values are strictly equal.
  • Corresponding data values are strictly equal.

Yet, after all of these checks succeed, df1.equals(df5) returns the astounding value False.

What difference between these dataframes is equals() uncovering that I have not yet figured out how to check for myself?

Test code:

csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""

import pandas as pd

df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)

df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)

print()
print(repr(df1.columns))
print(repr(df5.columns))
print()
print(df1.dtypes)
print(df5.dtypes)
print()
print(df1 == df5)
print()
print(df1.index == df5.index)
print()
print(df1.equals(df5))

The output that I receive when I run the script is:

                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks
                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks

Index(['title', 'year', 'director'], dtype='object')
Index(['title', 'year', 'director'], dtype='object')

title       object
year         int64
director    object
dtype: object
title       object
year         int64
director    object
dtype: object

  title  year director
0  True  True     True
1  True  True     True
2  True  True     True
3  True  True     True
4  True  True     True

[ True  True  True  True  True]

False

Thanks for any help!

like image 918
Brandon Rhodes Avatar asked Mar 27 '15 23:03

Brandon Rhodes


People also ask

How can you tell if two DataFrames are identical in Pandas?

DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.

How do you check if two data frames are the same in Python?

equals() function is used to determine if two dataframe object in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.

How do I compare two data frames?

Overview. The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.


1 Answers

This feels like a bug to me, but could be simply that I'm misunderstanding something. The blocks are listed in a different order:

>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
>>> df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64

In core/internals.py, we have the BlockManager method

def equals(self, other):
    self_axes, other_axes = self.axes, other.axes
    if len(self_axes) != len(other_axes):
        return False
    if not all (ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
        return False
    self._consolidate_inplace()
    other._consolidate_inplace()
    return all(block.equals(oblock) for block, oblock in
               zip(self.blocks, other.blocks))

and that last all assumes that the blocks in self and other correspond. But if we add some print calls before it, we see:

>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False

and so we're comparing the wrong things. The reason I'm not sure whether or not this is a bug is because I'm not sure whether equals is meant to be this finicky or not. If so, I think there's a doc bug, at least, because equals should then shout that it's not meant to be used for what you might think it would be from the name and the docstring.

like image 190
DSM Avatar answered Oct 28 '22 11:10

DSM