I have a database of 350,000
strings with an average length of about 500
. The strings are not made up of words, they are an essentially random assortment of characters.
I need to make sure no two of the strings are too similar, where similarity is defined as edit distance divided by avg length of string. The division is because smaller edit distances are more acceptable for smaller strings. It is fine if a different metric is used for performance reasons, but edit distance is the preferred baseline metric.
Naively, we calculate edit distance with runtime O(a*b)
, where a,b
are the length of the two strings. We do this for all n^2
pairs, which gives an overall runtime of O(n^2*a*b)
, clearly too large with n=350,000, a,b=500
.
The database is in the form of a Python list read from a csv file. I'd like to process it in a Pythonic way, if possible.
How can this be sped up? I'm not sure how long the naive algorithm will take to finish (on the order of weeks) but it ideally should take less than a day to run.
I wrote a very brief prototype of a simple locality sensitive hashing algorithm in python. However there are a few caveats and you may want to optimize some pieces as well. I'll mention them when we see them.
Assume all your strings are stored in strings
.
import random
from collections import Counter
MAX_LENGTH = 500
SAMPLING_LENGTH = 10
def bit_sampling(string, indices):
return ''.join([string[i] if i<len(string) else ' ' for i in indices])
indices = random.sample(range(MAX_LENGTH),SAMPLING_LENGTH)
hashes = [bit_sampling(string, indices) for string in strings]
counter = Counter(hashes)
most_common, count = counter.most_common()[0]
while count > 1:
dup_indices = [i for i, x in enumerate(hashes) if x == most_common]
# You can use dup_indices to check the edit distance for original groups here.
counter.pop(most_common)
most_common, count = counter.most_common()[0]
First of all, this is a slight variant of bit sampling that works best for the general hamming distance. Ideally if all your string are of the same length, this can give a theoretical probability bound for the hamming distance. When the hamming distance between two string is small, it is very unlikely that they will have different hash. This can be specified by the parameter SAMPLING_LENGTH
. A larger SAMPLING_LENGTH
will make it more likely to hash similar string to different hash but also reduce the probability of hashing not very similar string to the same hash. For hamming distance, you can calculate this trade-off easily.
Run this snippet multiple times can increase your confident on no similar strings since each time you will sample different places.
To accommodate your purpose to compare different length strings, one possible approach is to left padding space on shorter strings and make copies of them.
Though all of the operation in this snippet are linear (O(n)), it may still consume significant memory and running time and it might be possible to reduce a constant factor.
You might also want to consider using more complicated locality sensitive hashing algorithm such as surveyed here: https://arxiv.org/pdf/1408.2927.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With