Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

contract of pandas.DataFrame.equals

Tags:

python

pandas

I have a simple test case of a function which returns a df that can potentially contain NaN. I was testing if the output and expected output were equal.

>>> output
Out[1]: 
      r   t  ts  tt  ttct
0  2048  30   0  90     1
1  4096  90   1  30     1
2     0  70   2  65     1

[3 rows x 5 columns]
>>> expected
Out[2]: 
      r   t  ts  tt  ttct
0  2048  30   0  90     1
1  4096  90   1  30     1
2     0  70   2  65     1

[3 rows x 5 columns]
>>> output == expected
Out[3]: 
      r     t    ts    tt  ttct
0  True  True  True  True  True
1  True  True  True  True  True
2  True  True  True  True  True

However, I can't simply rely on the == operator because of NaNs. I was under the impression that the appropriate way to resolve this was by using the equals method. From the documentation:

pandas.DataFrame.equals
DataFrame.equals(other)
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.

Nonetheless:

>>> expected.equals(log_events)
Out[4]: False

A little digging around reveals the difference in the frames:

>>> output._data
Out[5]: 
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
FloatBlock: [r], 1 x 3, dtype: float64
IntBlock: [t, ts, tt, ttct], 4 x 3, dtype: int64
>>> expected._data
Out[6]: 
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
IntBlock: [r, t, ts, tt, ttct], 5 x 3, dtype: int64

Force that output float block to int, or force the expected int block to float, and the test passes.

Obviously, there are different senses of equality, and the sort of test that DataFrame.equals performs could be useful in some cases. Nonetheless, the disparity between == and DataFrame.equals is frustrating to me and seems like an inconsistency. In pseudo-code, I would expect its behavior to match:

(self.index == other.index).all() \
and (self.columns == other.columns).all() \
and (self.values.fillna(SOME_MAGICAL_VALUE) == other.values.fillna(SOME_MAGICAL_VALUE)).all().all()

However, it doesn't. Am I wrong in my thinking, or is this an inconsistency in the Pandas API? Moreover, what IS the test I should be performing for my purposes, given the possible presence of NaN?

like image 486
jwilner Avatar asked Oct 24 '14 16:10

jwilner


1 Answers

.equals() does just what it says. It tests for exact equality among elements, positioning of nans (and NaTs), dtype equality, and index equality. Think of this as as df is df2 type of test but they don't have to actually be the same object, IOW, df.equals(df.copy()) IS always True.

Your example fails because different dtypes are not equal (they may be equivalent though). So you can use com.array_equivalent for this, or (df == df2).all().all() if you don't have nans.

This is a replacement for np.array_equal which is broken for nan positional detections (and object dtypes).

It is mostly used internally. That said if you like an enhancement for equivalence (e.g. the elements are equivalent in the == sense and nan positionals match), pls open an issue on github. (and even better submit a PR!)

like image 92
Jeff Avatar answered Sep 28 '22 18:09

Jeff