Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange python's hashlib.md5 behavior, different hash each time

Tags:

python

hash

I've faced some really strange behavior trying to calculate md5 hash of string. Returned hash is always wrong (and different) if I pass string that was result of concatenation. Only way to get real hash I've found is to pass string that wasn't modified in any way after creation.

Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> m = hashlib.md5() 
>>> a1 = "stack"
>>> a2 = "overflow"
>>> a3 = a1 + a2
>>> a4 = str(a1 + a2)
>>> m.update("stackoverflow")
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc' //actuall hash
>>> m.update(a1 + a2)
>>> m.hexdigest()
'458b7358b9e0c3f561957b96e543c5a8'
>>> m.update(a3)
>>> m.hexdigest()
'65b0e62d4ff2d91e111ecc8f27f0e8f5'
>>> m.update(a4)
>>> m.hexdigest()
'60c3ae3dd9a2095340b2e024194bad3c'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'acd4e14145d34dcb10af785badf8e73e'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'03c06ca09faa26166f1096db02272b11'
>>> a1 + a2 == a1 + a2
True
>>> a1 + a2 == a3
True
>>> a3 == a4
True

Am I missing something?

like image 732
mushi.f Avatar asked Apr 29 '17 15:04

mushi.f


1 Answers

What you are missing is that hash.update() doesn't replace the hashed data. You are continually updating the hash object, so you are getting the hash of the concatenated strings. From the hashlib.hash.update() documentation:

Update the hash object with the string arg. Repeated calls are equivalent to a single call with the concatenation of all the arguments: m.update(a); m.update(b) is equivalent to m.update(a+b).

Bold emphasis mine.

So you are not getting the hash of a single 'stackoverflow' string, you are getting the hash first of 'stackoverflow', then of 'stackoverflowstackoverflow', then 'stackoverflowstackoverflowstackoverflow' etc., each time appending another 'stackoverflow' creating a longer and longer string. None of those longer strings are equal to the original short string so their hashes are not likely to be equal either.

Create a new object for new strings, instead:

>>> import hashlib
>>> m = hashlib.md5()
>>> m.update('stack' + 'overflow')
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'
>>> m = hashlib.md5()   # **new** hash object
>>> m.update('stackoverflow')
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'
>>> m = hashlib.md5()     # new object again
>>> m.update('stack')     # add the string in pieces, part 1
>>> m.update('overflow')  # and part 2
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'

You can readily produce your 'wrong' hashes by sending in concatenated data:

>>> m = hashlib.md5()
>>> m.update('stackoverflowstackoverflow')
>>> m.hexdigest()
'458b7358b9e0c3f561957b96e543c5a8'
>>> m = hashlib.md5()
>>> m.update('stackoverflowstackoverflowstackoverflow')
>>> m.hexdigest()
'65b0e62d4ff2d91e111ecc8f27f0e8f5'
>>> m = hashlib.md5()
>>> m.update('stackoverflow' * 4)
>>> m.hexdigest()
'60c3ae3dd9a2095340b2e024194bad3c'

Note that you can also pass in the first string into the md5() function:

>>> hashlib.md5('stackoverflow').hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'

You normally use the hash.update() method only if you are processing data in chunks (like reading a file line by line or reading blocks of data from a socket), and don't want to have to hold all of that data in memory at once.

like image 130
Martijn Pieters Avatar answered Sep 20 '22 10:09

Martijn Pieters