Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improved Levenshtein Algorithm

I recently implemented the levenshtein algorithm into our search engine database, but we have come across a problem.

According to the basic levenshtein

Levenshtein('123456','12x456') is the same value as Levenshtein('123456', '12345x')

Normally this is fine, but for my specific problem that is incorrect. When someone uses our website, this is incorrect. Manufacturers of electronic components often make similar products with only a difference in the very last letter. If the first letter is different, it's usually an entirely different category. Therefore I need an algorithm that considers matches near the beginning of the word more valuable than those in the back or in other words, mismatches that occur near the beginning should apply a larger penalty than those in the back.

If anyone has any idea please let me know.

like image 270
Mike D Avatar asked Oct 20 '11 20:10

Mike D


3 Answers

Use the Jaro-Winkler Distance... It's exactly what you are asking for.

like image 92
Will Castillo Avatar answered Nov 10 '22 10:11

Will Castillo


See the Smith-Waterman algorithm widely used in bioinformatics. It can perform a local alignment of your query, but that will be slower that Levenshtein.

like image 37
Pierre Avatar answered Nov 10 '22 09:11

Pierre


We had a similar issue and used the Brew edit distance method

We were in Perl so we used the Text::Brew library. My co-worker did a nice presentation on using a few different algorithms, including Brew.

like image 1
Rob Di Marco Avatar answered Nov 10 '22 09:11

Rob Di Marco