Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly check large database for edit-distance similarity

I have a database of 350,000 strings with an average length of about 500. The strings are not made up of words, they are an essentially random assortment of characters.

I need to make sure no two of the strings are too similar, where similarity is defined as edit distance divided by avg length of string. The division is because smaller edit distances are more acceptable for smaller strings. It is fine if a different metric is used for performance reasons, but edit distance is the preferred baseline metric.

Naively, we calculate edit distance with runtime O(a*b), where a,b are the length of the two strings. We do this for all n^2 pairs, which gives an overall runtime of O(n^2*a*b), clearly too large with n=350,000, a,b=500.

The database is in the form of a Python list read from a csv file. I'd like to process it in a Pythonic way, if possible.

How can this be sped up? I'm not sure how long the naive algorithm will take to finish (on the order of weeks) but it ideally should take less than a day to run.

like image 227
Evan Weissburg Avatar asked Feb 16 '18 02:02

Evan Weissburg


1 Answers

I wrote a very brief prototype of a simple locality sensitive hashing algorithm in python. However there are a few caveats and you may want to optimize some pieces as well. I'll mention them when we see them.

Assume all your strings are stored in strings.

import random
from collections import Counter

MAX_LENGTH = 500
SAMPLING_LENGTH = 10

def bit_sampling(string, indices):
    return ''.join([string[i] if i<len(string) else ' ' for i in indices])

indices = random.sample(range(MAX_LENGTH),SAMPLING_LENGTH)
hashes = [bit_sampling(string, indices) for string in strings]

counter = Counter(hashes)
most_common, count = counter.most_common()[0]
while count > 1:
    dup_indices = [i for i, x in enumerate(hashes) if x == most_common]
    # You can use dup_indices to check the edit distance for original groups here.
    counter.pop(most_common)
    most_common, count = counter.most_common()[0]

First of all, this is a slight variant of bit sampling that works best for the general hamming distance. Ideally if all your string are of the same length, this can give a theoretical probability bound for the hamming distance. When the hamming distance between two string is small, it is very unlikely that they will have different hash. This can be specified by the parameter SAMPLING_LENGTH. A larger SAMPLING_LENGTH will make it more likely to hash similar string to different hash but also reduce the probability of hashing not very similar string to the same hash. For hamming distance, you can calculate this trade-off easily.

Run this snippet multiple times can increase your confident on no similar strings since each time you will sample different places.

To accommodate your purpose to compare different length strings, one possible approach is to left padding space on shorter strings and make copies of them.

Though all of the operation in this snippet are linear (O(n)), it may still consume significant memory and running time and it might be possible to reduce a constant factor.

You might also want to consider using more complicated locality sensitive hashing algorithm such as surveyed here: https://arxiv.org/pdf/1408.2927.pdf

like image 134
Haochen Wu Avatar answered Nov 07 '22 17:11

Haochen Wu