In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:
ipdb> expect 1 2 2012-01-01 00:00:00+00:00 NaN 3 2013-05-14 12:00:00+00:00 3 NaN ipdb> df identifier 1 2 timestamp 2012-01-01 00:00:00+00:00 NaN 3 2013-05-14 12:00:00+00:00 3 NaN ipdb> df[1][0] nan ipdb> df[1][0], expect[1][0] (nan, nan) ipdb> df[1][0] == expect[1][0] False ipdb> df[1][1] == expect[1][1] True ipdb> type(df[1][0]) <type 'numpy.float64'> ipdb> type(expect[1][0]) <type 'numpy.float64'> ipdb> (list(df[1]), list(expect[1])) ([nan, 3.0], [nan, 3.0]) ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2 False
Given that I'm trying to test the entire of expect
against the entire of df
, including NaN
positions, what am I doing wrong?
What is the simplest way to compare equality of Series/DataFrames including NaN
s?
Pandas DataFrame: equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.
In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.
You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
In [11]: from pandas.testing import assert_frame_equal In [12]: assert_frame_equal(df, expected, check_names=False)
You can wrap this in a function with something like:
try: assert_frame_equal(df, expected, check_names=False) return True except AssertionError: return False
In more recent pandas this functionality has been added as .equals
:
df.equals(expected)
One of the properties of NaN
is that NaN != NaN
is True
.
Check out this answer for a nice way to do this using numexpr
.
(a == b) | ((a != a) & (b != b))
says this (in pseudocode):
a == b or (isnan(a) and isnan(b))
So, either a
equals b
, or both a
and b
are NaN
.
If you have small frames then assert_frame_equal
will be okay. However, for large frames (10M rows) assert_frame_equal
is pretty much useless. I had to interrupt it, it was taking so long.
In [1]: df = DataFrame(rand(1e7, 15)) In [2]: df = df[df > 0.5] In [3]: df2 = df.copy() In [4]: df Out[4]: <class 'pandas.core.frame.DataFrame'> Int64Index: 10000000 entries, 0 to 9999999 Columns: 15 entries, 0 to 14 dtypes: float64(15) In [5]: timeit (df == df2) | ((df != df) & (df2 != df2)) 1 loops, best of 3: 598 ms per loop
timeit
of the (presumably) desired single bool
indicating whether the two DataFrame
s are equal:
In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all() 1 loops, best of 3: 687 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With