To confirm that I understand what Pandas df.groupby()
and df.reset_index()
do, I attempted to do a round-trip from a dataframe to a grouped version of the same data and back. After the round-trip the columns and rows had to be sorted again, because groupby()
affects row order and reset_index()
affects column order, but after two quick maneuvers to put the columns and index back in order, the dataframes look identical:
Yet, after all of these checks succeed, df1.equals(df5)
returns the astounding value False
.
What difference between these dataframes is equals()
uncovering that I have not yet figured out how to check for myself?
Test code:
csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""
import pandas as pd
df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)
df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)
print()
print(repr(df1.columns))
print(repr(df5.columns))
print()
print(df1.dtypes)
print(df5.dtypes)
print()
print(df1 == df5)
print()
print(df1.index == df5.index)
print()
print(df1.equals(df5))
The output that I receive when I run the script is:
title year director
0 North by Northwest 1959 Alfred Hitchcock
1 Notorious 1946 Alfred Hitchcock
2 The Philadelphia Story 1940 George Cukor
3 To Catch a Thief 1955 Alfred Hitchcock
4 His Girl Friday 1940 Howard Hawks
title year director
0 North by Northwest 1959 Alfred Hitchcock
1 Notorious 1946 Alfred Hitchcock
2 The Philadelphia Story 1940 George Cukor
3 To Catch a Thief 1955 Alfred Hitchcock
4 His Girl Friday 1940 Howard Hawks
Index(['title', 'year', 'director'], dtype='object')
Index(['title', 'year', 'director'], dtype='object')
title object
year int64
director object
dtype: object
title object
year int64
director object
dtype: object
title year director
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
[ True True True True True]
False
Thanks for any help!
DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.
equals() function is used to determine if two dataframe object in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.
Overview. The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.
This feels like a bug to me, but could be simply that I'm misunderstanding something. The blocks are listed in a different order:
>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
>>> df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
In core/internals.py
, we have the BlockManager
method
def equals(self, other):
self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
return False
if not all (ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
return False
self._consolidate_inplace()
other._consolidate_inplace()
return all(block.equals(oblock) for block, oblock in
zip(self.blocks, other.blocks))
and that last all
assumes that the blocks in self
and other
correspond. But if we add some print
calls before it, we see:
>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False
and so we're comparing the wrong things. The reason I'm not sure whether or not this is a bug is because I'm not sure whether equals
is meant to be this finicky or not. If so, I think there's a doc bug, at least, because equals
should then shout that it's not meant to be used for what you might think it would be from the name and the docstring.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With