Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Order-invariant hash in Python

Tags:

python

hash

In Python, I would like to quickly compute an order-invariant hash for the lines of a file as a way to identify "uniquely" its content. These files are for example the output of a select ... from table and thus the order of the lines is random.

Here is an example that achieves what I want (using one of the hashers in hashlib), but at the expense of having to sort the lines. Note that sorting the lines is just a way to achieve the goal, i.e. to get a hash that doesn't depend on the ordering of the lines in the file. But clearly, I'd like to avoid the O(n*log(n)) cost, esp. when the files are much longer.

def get_hexdigest(filename, hasher, blocksize=65536, order_invariant=False):
    if not os.path.isfile(filename):
        return None
    if order_invariant:
        with open(filename, 'r') as f:
            for line in sorted(f):
                hasher.update(line.encode())
    else:
        with open(filename, 'rb') as f:
            while True:
                buf = f.read(blocksize)
                hasher.update(buf)
                if len(buf) < blocksize:
                    break
    return hasher.hexdigest()

So, for e.g. 1MB, 50K rows file:

%%time
get_hexdigest('some_file', hashlib.sha1())
# Wall time: 1.71 ms

But:

%%time
get_hexdigest('some_file', hashlib.sha1(), order_invariant=True)
# Wall time: 77.4 ms

What is a better/faster way to do that?

As noted in this answer, Scala has an order-invariant hash based on Murmurhash, but I assume it is the 32-bit version of mmh3 (too collision-prone for my usage), and also I would rather use some standard library available in Python rather than implementing something in C or in Cython. Murmurhash3 has a 128bit version, but its output is different on x64 vs x86. I would like to have machine independent results.

So, in summary, I would like:

  • consistent results across machine architectures
  • low collision rate, i.e. at least 128 bits with good dispersion (but I don't need the hash to be cryptographic)
  • reasonably fast, i.e. at least under 5ms for 1MB, 50K lines file.
  • readily available, if possible, as a library on PyPi or Conda.
  • amenable to files with repeated lines (so just XORing per-line hashes is a non-starter, as any pair of identical lines would cancel each other).

Edits and notes: Thanks to several comments, the code above is updated to sort lines in memory. The original version for order_invariant is True was:

    with os.popen('sort {}'.format(filename)) as f:
        for line in f:
            hasher.update(line.encode(encoding='utf-8'))
    return hasher.hexdigest()

The associated wall time (for the file used above) was then 238 ms. This is now reduced to 77 ms, but still way slower than not sorting the lines. Sorting will add a n*log(n) cost for n lines.

The encoding (to UTF-8) and reading in mode 'r' nor 'rb' is necessary when reading lines, as then we get strings not bytes. I don't want to rely on assuming that the files contain only ASCII data; reading in 'rb' could lead to lines not properly split. I don't have the same concern when order_invariant is False, because then I don't have to split the file, and thus the fastest way is to slurp chunks of binary data to update the hasher.

like image 936
Pierre D Avatar asked Feb 17 '17 20:02

Pierre D


2 Answers

I think you should sort the file before (select ... from table order by ...) or come up with another solution for your actual problem.

Anyways, a possible approach in Python using a frozenset:

#!/usr/bin/python

lines1 = ['line1', 'line2', 'line3', 'line4']
lines2 = ['line2', 'line1', 'line3', 'line4']  # same as lines1 but different order
lines3 = ['line1', 'line1', 'line3', 'line4', 'line5']


for lines in [lines1, lines2, lines3]:
    print(lines)
    print(hash(frozenset(lines)))
    print('')

Output

['line1', 'line2', 'line3', 'line4']
8013284786872469720

['line2', 'line1', 'line3', 'line4']
8013284786872469720

['line1', 'line1', 'line3', 'line4', 'line5']
7430298023231386903

I doubt it will match your performance constrains. I don't know the time complexity (Big O) of the frozenset(). It also assumes lines are unique. Again, I highly suggest to tackle the underlying problem differently.

like image 183
Rolf Avatar answered Nov 19 '22 09:11

Rolf


How about this merkle-style map-reduce (hash concatenated mapped hashes, optional sort for invariant after hash map step):

import hashlib

def hasher(data):
    hasher = hashlib.sha1()
    hasher.update(data.encode('utf-8'))
    return hasher.hexdigest()


def get_digest_by_line(filename, line_invariant=False, hasher=hasher):
    with open(filename, 'r') as f:
        hashes = (hasher(line) for line in f)
        if line_invariant:
            hashes = sorted(hashes)
        return hasher(''.join(hashes))
like image 37
Sean Summers Avatar answered Nov 19 '22 08:11

Sean Summers