Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Locality-sensitive hashing of strings?

Is there a hash function for strings, such that strings within a small edit distance (for example, misspellings) would map to the same, or very close, hash values, while dissimilar strings would tend not to?

like image 275
MWB Avatar asked Oct 29 '25 21:10

MWB


1 Answers

One option is to calculate set of all k-mers (substrings of length k), hash them and calculate the minimum. So you are combining idea of shingles, with idea of minhashing. (repeat multiple times to get better results, as usual with LSH schemes)

The way how this works is that probability of two string having same minhash is same as Jackard similarity of their k-mer sets. Similarity of k-mer sets is related to edit distance (but not the same).

like image 86
usamec Avatar answered Oct 31 '25 12:10

usamec