In Python, I would like to quickly compute an order-invariant hash for the lines of a file as a way to identify "uniquely" its content. These files are for example the output of a select ... from table
and thus the order of the lines is random.
Here is an example that achieves what I want (using one of the hashers in hashlib), but at the expense of having to sort the lines. Note that sorting the lines is just a way to achieve the goal, i.e. to get a hash that doesn't depend on the ordering of the lines in the file. But clearly, I'd like to avoid the O(n*log(n)) cost, esp. when the files are much longer.
def get_hexdigest(filename, hasher, blocksize=65536, order_invariant=False):
if not os.path.isfile(filename):
return None
if order_invariant:
with open(filename, 'r') as f:
for line in sorted(f):
hasher.update(line.encode())
else:
with open(filename, 'rb') as f:
while True:
buf = f.read(blocksize)
hasher.update(buf)
if len(buf) < blocksize:
break
return hasher.hexdigest()
So, for e.g. 1MB, 50K rows file:
%%time
get_hexdigest('some_file', hashlib.sha1())
# Wall time: 1.71 ms
But:
%%time
get_hexdigest('some_file', hashlib.sha1(), order_invariant=True)
# Wall time: 77.4 ms
What is a better/faster way to do that?
As noted in this answer, Scala has an order-invariant hash based on Murmurhash, but I assume it is the 32-bit version of mmh3 (too collision-prone for my usage), and also I would rather use some standard library available in Python rather than implementing something in C or in Cython. Murmurhash3 has a 128bit version, but its output is different on x64 vs x86. I would like to have machine independent results.
So, in summary, I would like:
Edits and notes:
Thanks to several comments, the code above is updated to sort lines in memory. The original version for order_invariant is True
was:
with os.popen('sort {}'.format(filename)) as f:
for line in f:
hasher.update(line.encode(encoding='utf-8'))
return hasher.hexdigest()
The associated wall time (for the file used above) was then 238 ms. This is now reduced to 77 ms, but still way slower than not sorting the lines. Sorting will add a n*log(n) cost for n lines.
The encoding (to UTF-8) and reading in mode 'r'
nor 'rb'
is necessary when reading lines, as then we get strings not bytes. I don't want to rely on assuming that the files contain only ASCII data; reading in 'rb'
could lead to lines not properly split. I don't have the same concern when order_invariant
is False, because then I don't have to split the file, and thus the fastest way is to slurp chunks of binary data to update the hasher.
I think you should sort the file before (select ... from table order by ...
) or come up with another solution for your actual problem.
Anyways, a possible approach in Python using a frozenset:
#!/usr/bin/python
lines1 = ['line1', 'line2', 'line3', 'line4']
lines2 = ['line2', 'line1', 'line3', 'line4'] # same as lines1 but different order
lines3 = ['line1', 'line1', 'line3', 'line4', 'line5']
for lines in [lines1, lines2, lines3]:
print(lines)
print(hash(frozenset(lines)))
print('')
Output
['line1', 'line2', 'line3', 'line4']
8013284786872469720
['line2', 'line1', 'line3', 'line4']
8013284786872469720
['line1', 'line1', 'line3', 'line4', 'line5']
7430298023231386903
I doubt it will match your performance constrains. I don't know the time complexity (Big O) of the frozenset(). It also assumes lines are unique. Again, I highly suggest to tackle the underlying problem differently.
How about this merkle-style map-reduce (hash concatenated mapped hashes, optional sort for invariant after hash map step):
import hashlib
def hasher(data):
hasher = hashlib.sha1()
hasher.update(data.encode('utf-8'))
return hasher.hexdigest()
def get_digest_by_line(filename, line_invariant=False, hasher=hasher):
with open(filename, 'r') as f:
hashes = (hasher(line) for line in f)
if line_invariant:
hashes = sorted(hashes)
return hasher(''.join(hashes))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With