I recently implemented the levenshtein algorithm into our search engine database, but we have come across a problem.
According to the basic levenshtein
Levenshtein('123456','12x456') is the same value as Levenshtein('123456', '12345x')
Normally this is fine, but for my specific problem that is incorrect. When someone uses our website, this is incorrect. Manufacturers of electronic components often make similar products with only a difference in the very last letter. If the first letter is different, it's usually an entirely different category. Therefore I need an algorithm that considers matches near the beginning of the word more valuable than those in the back or in other words, mismatches that occur near the beginning should apply a larger penalty than those in the back.
If anyone has any idea please let me know.
Use the Jaro-Winkler Distance... It's exactly what you are asking for.
See the Smith-Waterman algorithm widely used in bioinformatics. It can perform a local alignment of your query, but that will be slower that Levenshtein.
We had a similar issue and used the Brew edit distance method
We were in Perl so we used the Text::Brew library. My co-worker did a nice presentation on using a few different algorithms, including Brew.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With