Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A good algorithm similar to Levenshtein but weighted for Qwerty keyboards?

Tags:

I noticed some posts here on string matching, which reminded me of an old problem I'd like to solve. Does anyone have a good Levenshtein-like algorithm that is weighted toward Qwerty keyboards?

I want to compare two strings, and and allow for typos. Levenshtein is okay, but I'd prefer to also accept spelling errors based on the physical distance between keys on Qwerty keyboard. In other words, the algorithm should prefer "yelephone" to "zelephone" since the "y" key is located nearer to the "t" key than to the "z" key on most keyboards.

Any help would be great... this feature isn't central to my project, so I don't want to veer off into a rat-hole when I should be doing something more productive.

like image 478
username Avatar asked Sep 08 '08 16:09

username


People also ask

Why do we use qwerty instead of Abcde?

Because they use the qwerty pattern originally developed for mechanical type-writers. The logic of the qwerty layout was based on letter usage in English rather than letter postion in the alphabet.

What is damerau levenshtein distance used for?

The restricted Damerau Levenshtein Distance between two strings is commonly used for checking typographical errors in strings. It takes the deletion and insertion of a character, a wrong character (substition) or the swapping (transposition) of two characters into account.

How does Python calculate levenshtein distance?

Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.


1 Answers

In bioinformatics when you align two sequences of DNA you might have a model that has a different cost based on if the substitution is a transition or a transversion. This is exactly what you want but instead of a 4x4 matrix, you want a 40x40 matrix or some, dare I say distance function? So the cost of a replacement is from the matrix/function, not a constant.

CAVEAT: Be sure that deletions and insertions are weighted properly though, so they aren't over accepted as the minimum. You'll end up with a string of insertions/deletions/no-change-substitution characters.

The new function you are trying to minimize would be:

d[i, j] := minimum(     d[i-1, j] + del_cost,     d[i, j-1] + ins_cost,     d[i-1, j-1] + keyboard_distance( s[i], t[j] ) ) 
like image 132
nlucaroni Avatar answered Oct 13 '22 19:10

nlucaroni