I need to be able to store a <code>numpy</code> <code>array</code> in a <code>dict</code> for caching purposes. Hash speed is important. The <code>array</code> represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value. What should I hash in order to store it in a <code>dict</code>? My current approach is to use <code>str(arr.data)</code>, which is faster than <code>md5</code> in my testing. <hr> I've incorporated some examples from the answers to get an idea of relative times: <pre class="prettyprint"><code>In [121]: %timeit hash(str(y)) 10000 loops, best of 3: 68.7 us per loop In [122]: %timeit hash(y.tostring()) 1000000 loops, best of 3: 383 ns per loop In [123]: %timeit hash(str(y.data)) 1000000 loops, best of 3: 543 ns per loop In [124]: %timeit y.flags.writeable = False ; hash(y.data) 1000000 loops, best of 3: 1.15 us per loop In [125]: %timeit hash((b*y).sum()) 100000 loops, best of 3: 8.12 us per loop </code></pre> It would appear that for this particular use case (small arrays of indicies), <code>arr.tostring</code> offers the best performance. While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.

You can simply hash the underlying buffer, if you make it read-only: <pre class="prettyprint"><code>>>> a = random.randint(10, 100, 100000) >>> a.flags.writeable = False >>> %timeit hash(a.data) 100 loops, best of 3: 2.01 ms per loop >>> %timeit hash(a.tostring()) 100 loops, best of 3: 2.28 ms per loop </code></pre> For very large arrays, <code>hash(str(a))</code> is a lot faster, but then it only takes a small part of the array into account. <pre class="prettyprint"><code>>>> %timeit hash(str(a)) 10000 loops, best of 3: 55.5 us per loop >>> str(a) '[63 30 33 ..., 96 25 60]' </code></pre>

You can try <code>xxhash</code> via its Python binding. For large arrays this is much faster than <code>hash(x.tostring())</code>. Example IPython session: <pre class="prettyprint"><code>>>> import xxhash >>> import numpy >>> x = numpy.random.rand(1024 * 1024 * 16) >>> h = xxhash.xxh64() >>> %timeit hash(x.tostring()) 1 loops, best of 3: 208 ms per loop >>> %timeit h.update(x); h.intdigest(); h.reset() 100 loops, best of 3: 10.2 ms per loop </code></pre> And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using <code>sha1</code> or <code>md5</code> as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns. Nevertheless, hash collisions happen all the time. And if all you need is implementing <code>__hash__</code> for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of <code>__hash__</code> itself and let Python handle the hash collision[1]. [1] You may need to override <code>__eq__</code> too, to help Python manage hash collision. You would want <code>__eq__</code> to return a boolean, rather than an array of booleans as is done by <code>numpy</code>.

Most efficient property to hash for numpy array

Tags:

python

numpy

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.

The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.

What should I hash in order to store it in a dict?

My current approach is to use str(arr.data), which is faster than md5 in my testing.

I've incorporated some examples from the answers to get an idea of relative times:

In [121]: %timeit hash(str(y)) 10000 loops, best of 3: 68.7 us per loop  In [122]: %timeit hash(y.tostring()) 1000000 loops, best of 3: 383 ns per loop  In [123]: %timeit hash(str(y.data)) 1000000 loops, best of 3: 543 ns per loop  In [124]: %timeit y.flags.writeable = False ; hash(y.data) 1000000 loops, best of 3: 1.15 us per loop  In [125]: %timeit hash((b*y).sum()) 100000 loops, best of 3: 8.12 us per loop

It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.

While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.

424

asked May 16 '13 14:05

sapi

2 Answers

You can simply hash the underlying buffer, if you make it read-only:

>>> a = random.randint(10, 100, 100000) >>> a.flags.writeable = False >>> %timeit hash(a.data) 100 loops, best of 3: 2.01 ms per loop >>> %timeit hash(a.tostring()) 100 loops, best of 3: 2.28 ms per loop

For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.

>>> %timeit hash(str(a)) 10000 loops, best of 3: 55.5 us per loop >>> str(a) '[63 30 33 ..., 96 25 60]'

106

answered Oct 18 '22 03:10

Fred Foo

You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).

Example IPython session:

>>> import xxhash >>> import numpy >>> x = numpy.random.rand(1024 * 1024 * 16) >>> h = xxhash.xxh64() >>> %timeit hash(x.tostring()) 1 loops, best of 3: 208 ms per loop >>> %timeit h.update(x); h.intdigest(); h.reset() 100 loops, best of 3: 10.2 ms per loop

And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.

Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].

[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.

answered Oct 18 '22 01:10

Cong Ma

Related questions
                            
                                How do I convert an array (i.e. list) column to Vector
                            
                                Using Flask-SQLAlchemy in Blueprint models without reference to the app [closed]
                            
                                Why is a.insert(0,0) much slower than a[0:0]=[0]?
                            
                                String concatenation without '+' operator
                            
                                Why can't I find any pywin32 documentation/resources
                            
                                What is time complexity of a list to set conversion?
                            
                                Killing a process created with Python's subprocess.Popen() [duplicate]
                            
                                Python type hints and context managers
                            
                                "getaddrinfo failed", what does that mean?
                            
                                How can I use Bootstrap with Django?
                            
                                Ensuring py.test includes the application directory in sys.path
                            
                                TypeError: super() takes at least 1 argument (0 given) error is specific to any python version?
                            
                                What is __path__ useful for?
                            
                                Close pre-existing figures in matplotlib when running from eclipse
                            
                                How can I dynamically create class methods for a class in python [duplicate]
                            
                                Django Admin: OneToOne Relation as an Inline?
                            
                                PyQt or PySide - which one to use [closed]
                            
                                How to add trendline in python matplotlib dot (scatter) graphs?
                            
                                What refactoring tools do you use for Python? [closed]
                            
                                What is the point of indexing in pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With