I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be:
For each string
currentstring= string
For each string other than currentstring
Calculate Hamming distance
I'd like to output the results as a matrix and be able to retrieve values. I'd also like to run it via Hadoop Streaming!
Any pointers are gratefully received.
Here is what i have tried but it is slow:
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
setOfFiles = set(files)
print len(setOfFiles)
i=0
j=0
for fname in files:
print 'fname',fname, 'setOfFiles', len(setOfFiles)
oneLessSetOfFiles=setOfFiles
oneLessSetOfFiles.remove(fname)
i+=1
for compareFile in oneLessSetOfFiles:
j+=1
hash1 = pHash.imagehash( fname )
hash2 = pHash.imagehash( compareFile)
print ...
The distance package in python provides a hamming distance calculator:
import distance
distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")
There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.
There is more information and example code for all of these on this old question.
Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:
files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
(hamming(first[0], second[0]), first, second)
for second in files
for first in files
if first[1] != second[1]
]
This process fundamentally involves O(N^2)
comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B
blocks where B^2 = M
(B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:
all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With