Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hash a large object (dataset) in Python?

People also ask

What does __ hash __ do in Python?

The hash() function accepts an object and returns the hash value as an integer. When you pass an object to the hash() function, Python will execute the __hash__ special method of the object. By default, the __hash__ uses the object's identity and the __eq__ returns True if two objects are the same.

What can be hashed in Python?

What is Hash Method in Python? Hash method in Python is a module that is used to return the hash value of an object. In programming, the hash method is used to return integer values that are used to compare dictionary keys using a dictionary look up feature.

How do you create a hash file in Python?

Source Code to Find Hash Hash functions are available in the hashlib module. We loop till the end of the file using a while loop. On reaching the end, we get empty bytes object. In each iteration, we only read 1024 bytes (this value can be changed according to our wish) from the file and update the hashing function.


Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'

What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlib
m = hashlib.md5() # or sha1 etc
for value in array: # array contains the data
    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).


There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.


Here is how I do it in jug (git HEAD at the time of this answer):

e = some_array_object
M = hashlib.md5()
M.update('np.ndarray')
M.update(pickle.dumps(e.dtype))
M.update(pickle.dumps(e.shape))
try:
    buffer = e.data
    M.update(buffer)
except:
    M.update(e.copy().data)

The reason is that e.data is only available for some arrays (contiguous arrays). Same thing with a.view(np.uint8) (which fails with a non-descriptive type error if the array is not contiguous).