I noticed some posts here on string matching, which reminded me of an old problem I'd like to solve. Does anyone have a good Levenshtein-like algorithm that is weighted toward Qwerty keyboards?
I want to compare two strings, and and allow for typos. Levenshtein is okay, but I'd prefer to also accept spelling errors based on the physical distance between keys on Qwerty keyboard. In other words, the algorithm should prefer "yelephone" to "zelephone" since the "y" key is located nearer to the "t" key than to the "z" key on most keyboards.
Any help would be great... this feature isn't central to my project, so I don't want to veer off into a rat-hole when I should be doing something more productive.
Because they use the qwerty pattern originally developed for mechanical type-writers. The logic of the qwerty layout was based on letter usage in English rather than letter postion in the alphabet.
The restricted Damerau Levenshtein Distance between two strings is commonly used for checking typographical errors in strings. It takes the deletion and insertion of a character, a wrong character (substition) or the swapping (transposition) of two characters into account.
Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.
In bioinformatics when you align two sequences of DNA you might have a model that has a different cost based on if the substitution is a transition or a transversion. This is exactly what you want but instead of a 4x4 matrix, you want a 40x40 matrix or some, dare I say distance function? So the cost of a replacement is from the matrix/function, not a constant.
CAVEAT: Be sure that deletions and insertions are weighted properly though, so they aren't over accepted as the minimum. You'll end up with a string of insertions/deletions/no-change-substitution characters.
The new function you are trying to minimize would be:
d[i, j] := minimum( d[i-1, j] + del_cost, d[i, j-1] + ins_cost, d[i-1, j-1] + keyboard_distance( s[i], t[j] ) )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With