Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrames with NaNs equality comparison

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:

ipdb> expect                             1   2 2012-01-01 00:00:00+00:00 NaN   3 2013-05-14 12:00:00+00:00   3 NaN  ipdb> df identifier                  1   2 timestamp 2012-01-01 00:00:00+00:00 NaN   3 2013-05-14 12:00:00+00:00   3 NaN  ipdb> df[1][0] nan  ipdb> df[1][0], expect[1][0] (nan, nan)  ipdb> df[1][0] == expect[1][0] False  ipdb> df[1][1] == expect[1][1] True  ipdb> type(df[1][0]) <type 'numpy.float64'>  ipdb> type(expect[1][0]) <type 'numpy.float64'>  ipdb> (list(df[1]), list(expect[1])) ([nan, 3.0], [nan, 3.0])  ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2 False 

Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?

What is the simplest way to compare equality of Series/DataFrames including NaNs?

like image 285
Steve Pike Avatar asked Oct 11 '13 16:10

Steve Pike


People also ask

How do you know if two pandas DataFrames are equal?

Pandas DataFrame: equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How do I compare two pandas DataFrames?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

How do you compare NULL values in a DataFrame?

In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.


2 Answers

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

In [11]: from pandas.testing import assert_frame_equal  In [12]: assert_frame_equal(df, expected, check_names=False) 

You can wrap this in a function with something like:

try:     assert_frame_equal(df, expected, check_names=False)     return True except AssertionError:     return False 

In more recent pandas this functionality has been added as .equals:

df.equals(expected) 
like image 89
Andy Hayden Avatar answered Sep 19 '22 03:09

Andy Hayden


One of the properties of NaN is that NaN != NaN is True.

Check out this answer for a nice way to do this using numexpr.

(a == b) | ((a != a) & (b != b)) 

says this (in pseudocode):

a == b or (isnan(a) and isnan(b)) 

So, either a equals b, or both a and b are NaN.

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

In [1]: df = DataFrame(rand(1e7, 15))  In [2]: df = df[df > 0.5]  In [3]: df2 = df.copy()  In [4]: df Out[4]: <class 'pandas.core.frame.DataFrame'> Int64Index: 10000000 entries, 0 to 9999999 Columns: 15 entries, 0 to 14 dtypes: float64(15)  In [5]: timeit (df == df2) | ((df != df) & (df2 != df2)) 1 loops, best of 3: 598 ms per loop 

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all() 1 loops, best of 3: 687 ms per loop 
like image 43
Phillip Cloud Avatar answered Sep 19 '22 03:09

Phillip Cloud