Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Levenshtein distance with non uniform cost for insertions and substitutions:

I have been trying to implement a levenshtein distance function in C++ that gives different weights to substitutions and insertions based on which characters are being replaced or inserted.

The cost is calculated based on the distance of the keys on an qwerty keyboard. For example, in the standard edit distance algorithm, the distance between google, hoogle, and zoogle is the same; 1. What I want is different distances for these. Something like google -> hoogle = 1, google -> zoogle = 4, hoogle -> zoogle = 5.

I followed the Wikipedia algorithm using the matrix for memoization and implemented it in c++. Here is my function.

int levDist(string s, string t) {

    int i,j,m,n,temp,subsitutionCost, deletionCost, insertionCost, keyDist;
    deletionCost = 1;

    m = s.length();
    n = t.length();
    int d[m+1][n+1];

    for(i=0;i<=m;i++)
        d[i][0] = i;
    for(j=0;j<=n;j++)
        d[0][j] = j;

    for (j=1;j<=n;j++)
    {
        for(i=1;i<=m;i++)
        {
            // getKeyboardDist(char a, char b) gives distance b/w the two keys
            keyDist = getKeyboardDist(s[i-1],t[j-1]); 

            subsitutionCost = (s[i-1] == t[j-1]) ? 0 : keyDist;

            // this line is the one i think the problem lies in
            insertionCost = (i > j) ? getKeyboardDist(s[i-1],t[j-2]) : getKeyboardDist(s[i-2],t[j-1]);


            insertionCost = insertionCost ? insertionCost : 1;

            d[i][j] = min((d[i-1][j]   + deletionCost),
                      min((d[i][j-1]   + insertionCost),
                          (d[i-1][j-1] + subsitutionCost)));`
        }
    }
    return d[m][n];
}

Now the subsitutions work properly I beleive, but the problem is the insertions. I dont know how to find which characters to get the distance between for insertions. Especially the cases when the insertion is in the beginning or end of the string.

I would appreciate any help in this, let me know if there is any other information needed.

Thanks in advance.

like image 821
KaziJehangir Avatar asked Oct 27 '25 10:10

KaziJehangir


1 Answers

What you're trying to do makes sense for substitutions. You're hypothesizing that a person attempting to strike a key X is more likely to make a mistake by striking a key physically near X than one that's far away.

It doesn't make so much sense for insertions and deletions because the act of striking an extra key (insertion error) or skipping a key strike (deletion error) doesn't have anything obvious to do with key distances.

It's possible you're being misdirected by the two different meanings of "distance" in play here. Levenshtein distance is measured between strings in insert/substitute/delete operations. Keyboard distance is a physical separation. These are apples and oranges that happen to be described with the same word. They don't mix very well.

You are attempting to determine weights for the Levenshtein operations. Physical distances between keys make reasonable weights for substitution.

Weights for insert and delete - which involve only one character each - don't have any obvious relationship to physical separation.

What you'd really want is frequency data regarding which keys people actually insert and delete by mistake. You'd give most common relatively low weights and the least common high ones.

@user6952491's idea that repeating the previous key might be a high frequency insert error has merit, but it's hard to extend this to a complete weighting scheme.

If you're in a guessing mood, you could hypothesize that it's easier to mistakenly insert a key near the middle of the keyboard than at the edges. Say f and j get lowest weights and characters like ~ that are shifted and at the keyboard extremes get high weights because it's unlikely you'd make the physical motions to type them without thinking.

I'll leave it to you to formulate a similar guess about deletions.

For general typing, my guess is that keyboarding errors will have at least as much to do with spelling misapprehensions as physical mistakes. I.e., people will type "recieve" because they forgot the rule "i before e except after c," not due to the keyboard position of i relative to e.

Other kinds of typing, e.g. computer code, might well have entirely different patterns for making mistakes. Forgotten semicolons come to mind! Those would have very low weights!

Consequently, I'm virtually certain that the suggestions provided by modern spell checkers are rooted in machine learning algorithms that draw conclusions from mistakes that many thousands of people have made in the past on similar tasks, not simple metrics based on keyboard distance.

like image 198
Gene Avatar answered Oct 28 '25 22:10

Gene



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!