Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do these dtypes compare equal but hash different?

Tags:

python

numpy

In [30]: import numpy as np

In [31]: d = np.dtype(np.float64)

In [32]: d
Out[32]: dtype('float64')

In [33]: d == np.float64
Out[33]: True

In [34]: hash(np.float64)
Out[34]: -9223372036575774449

In [35]: hash(d)
Out[35]: 880835502155208439

Why do these dtypes compare equal but hash different?

Note that Python does promise that:

The only required property is that objects which compare equal have the same hash value…

My workaround for this problem is to call np.dtype on everything, after which hash values and comparisons are consistent.

like image 707
Neil G Avatar asked Feb 09 '16 13:02

Neil G


3 Answers

As tttthomasssss notes, the type (class) for np.float64 and d are different. They are different kinds of things:

In [435]: type(np.float64)
Out[435]: type

Type type means (usually) that it is a function, so it can be used as:

In [436]: np.float64(0)
Out[436]: 0.0

In [437]: type(_)
Out[437]: numpy.float64

creating a numeric object. Actually that looks more like a class definition. But since numpy uses a lot of compiled code, and its ndarray uses its own __new__, I wouldn't be surprised if it straddles the line.

In [438]: np.float64.__hash__??
Type:       wrapper_descriptor
String Form:<slot wrapper '__hash__' of 'float' objects>
Docstring:  x.__hash__() <==> hash(x)

I was thinking this would the hash(np.float64), but it might actually be the hash for an object of that type, e.g. hash(np.float64(0)). In that case hash(np.float64) just uses the default type.__hash__ method.

Moving on to the dtype:

In [439]: d=np.dtype(np.float64)

In [440]: type(d)
Out[440]: numpy.dtype

d is not a function or class:

In [441]: d(0)
...
TypeError: 'numpy.dtype' object is not callable

In [442]: d.__hash__??
Type:       method-wrapper
String Form:<method-wrapper '__hash__' of numpy.dtype object at 0xb60f8a60>
Docstring:  x.__hash__() <==> hash(x)

Looks like np.dtype does not define any special __hash__ method, it just inherits from object.

Further illustrating the difference between float64 and d, look at the class inheritance stack

In [443]: np.float64.__mro__
Out[443]: 
(numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object)

In [444]: d.__mro__
...
AttributeError: 'numpy.dtype' object has no attribute '__mro__'

In [445]: np.dtype.__mro__
Out[445]: (numpy.dtype, object)

So np.float64 doesn't define a hash either, it just inherits from float. d doesn't have an __mro__ because it's an object, not a class.

numpy has enough compiled code, and a long history of its own, that you can't count on Python documentation always applying.

np.dtype and np.float64 evidently have __eq__ methods that allow them to be compared with each other, but numpy developers did not put any effort into making sure that the __hash__ methods comply. Most likely because they don't need to use either as a dictionary key.

I've never seen code like:

In [453]: dd={np.float64:12,d:34}

In [454]: dd
Out[454]: {dtype('float64'): 34, numpy.float64: 12}

In [455]: dd[np.float64]
Out[455]: 12

In [456]: dd[d]
Out[456]: 34
like image 115
hpaulj Avatar answered Nov 04 '22 11:11

hpaulj


They shouldn't behave this way, but __eq__ and __hash__ for numpy.dtype objects are broken on an essentially unfixable design level. I'll be pulling heavily from njsmith's comments on a dtype-related bug report for this answer.

np.float64 isn't actually a dtype. It's a type, in the ordinary sense of the Python type system. Specifically, if you retrieve a scalar from an array of float64 dtype, np.float64 is the type of the resulting scalar.

np.dtype(np.float64) is a dtype, an instance of numpy.dtype. dtypes are how NumPy records the structure of the contents of a NumPy array. They are particularly important for structured arrays, which can have very complex dtypes. While ordinary Python types could have filled much of the role of dtypes, creating new types on the fly for new structured arrays would be highly awkward, and it would probably have been impossible in the days before type-class unification.

numpy.dtype implements __eq__ basically like this:

def __eq__(self, other):
    if isinstance(other, numpy.dtype):
        return regular_comparison(self, other)
    return self == numpy.dtype(other)

which is pretty broken. Among other problems, it's not transitive, it raises TypeError when it should return NotImplemented, and its output is really bizarre at times because of how dtype coercion works:

>>> x = numpy.dtype(numpy.float64)
>>> x == None
True

numpy.dtype.__hash__ isn't any better. It makes no attempt to be consistent with the __hash__ methods of all the other types numpy.dtype.__eq__ accepts (and with so many incompatible types to deal with, how could it?). Heck, it shouldn't even exist, because dtype objects are mutable! Not just mutable like modules or file objects, where it's okay because __eq__ and __hash__ work by identity. dtype objects are mutable in ways that will actually change their hash value:

>>> x = numpy.dtype([('f1', float)])
>>> hash(x)
-405377605
>>> x.names = ['f2']
>>> hash(x)
1908240630

When you try to compare d == np.float64, d.__eq__ builds a dtype out of np.float64 and finds that d == np.dtype(np.float64) is True. When you take their hashes, though, np.float64 uses the regular (identity-based) hash for type objects and d uses the hash for dtype objects. Normally, equal objects of different types should have equal hashes, but the dtype implementation doesn't care about that.

Unfortunately, it's impossible to fix the problems with dtype __eq__ and __hash__ without breaking APIs people are relying on. People are counting on things like x.dtype == 'float64' or x.dtype == np.float64, and fixing dtypes would break that.

like image 37
user2357112 supports Monica Avatar answered Nov 04 '22 10:11

user2357112 supports Monica


They are not the same thing, while np.float64 is a type, d is an instance of numpy.dtype, hence they hash to different values, but all instances of d created the same way will hash to the same value because they are identical (which of course does not necessarily mean they point to the same memory location).

Edit:

Given your code above you can try the following:

In [72]: type(d)
Out[72]: numpy.dtype

In [74]: type(np.float64)
Out[74]: type

which shows you that the two are of different type and hence will hash to different values. Showing that different instances of numpy.dtype can be shown by the following example:

In [77]: import copy
In [78]: dd = copy.deepcopy(d) # Try copying

In [79]: dd
Out[79]: dtype('float64')

In [80]: hash(dd)
Out[80]: -6584369718629170405

In [81]: hash(d) # original d
Out[81]: -6584369718629170405

In [82]: ddd = np.dtype(np.float64) # new instance
In [83]: hash(ddd)
Out[83]: -6584369718629170405

# If using CPython, id returns the address in memory (see: https://docs.python.org/3/library/functions.html#id)
In [84]: id(ddd)
Out[84]: 4376165768

In [85]: id(dd)
Out[85]: 4459249168

In [86]: id(d)
Out[86]: 4376165768

Its nice to see that ddd (the instance created the same way as d), and d itself share the same object in memory, but dd (the copied object) uses a different address.

The equality checks evaluate as you would expect, given the hashes above:

In [87]: dd == np.float64
Out[87]: True
In [88]: d == np.float64
Out[88]: True
In [89]: ddd == np.float64
Out[89]: True
In [90]: d == dd
Out[90]: True
In [91]: d == ddd
Out[91]: True
In [92]: dd == ddd
Out[92]: True
like image 1
tttthomasssss Avatar answered Nov 04 '22 11:11

tttthomasssss