Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to modify Levenshteins Edit Distance to count "adjacent letter exchanges" as 1 edit

I'm playing around with Levenshteins Edit Distance algorithm, and I want to extend this to count transpositions -- that is, exchanges of adjacent letters -- as 1 edit. The unmodified algorithm counts insertions, deletes or substitutions needed to reach a certain string from another. For instance, the edit distance from "KITTEN" to "SITTING" is 3. Here's the explanation from Wikipedia:

  1. kitten → sitten (substitution of 'k' with 's')
  2. sitten → sittin (substitution of 'e' with 'i')
  3. sittin → sitting (insert 'g' at the end).

Following the same method, the edit distance from "CHIAR" to "CHAIR" is 2:

  1. CHIAR → CHAR (delete 'I')
  2. CHAR → CHAIR (insert 'I')

I would like to count this as "1 edit", since I only exchange two adjacent letters. How would I go about to do this?

like image 299
Svein Bringsli Avatar asked Oct 29 '10 20:10

Svein Bringsli


People also ask

What is the algorithm to find the edit distance between two words?

For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2. 2. Similarly, for the second case of inserting last character of str2 into str1, we have to do - (1 + array[m][n-1]).

What is edit distance in sequence alignment?

The edit-distance is the score of the best possible alignment between the two genetic sequences over all possible alignments. In this example, the second alignment is in fact optimal, so the edit-distance between the two strings is 7.

What is the main use of the damerau Levenshtein distance?

While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.

How is levenshtein calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.


2 Answers

You need one more case in the algorithm from Wikipedia:

if s[i] = t[j] then 
  d[i, j] := d[i-1, j-1]
else if i > 0 and j > 0 and s[i] = t[j - 1] and s[i - 1] = t[j] then
  d[i, j] := minimum
             (
               d[i-2, j-2] + 1 // transpose
               d[i-1, j] + 1,  // deletion
               d[i, j-1] + 1,  // insertion
               d[i-1, j-1] + 1 // substitution
             )
else
  d[i, j] := minimum
             (
               d[i-1, j] + 1,  // deletion
               d[i, j-1] + 1,  // insertion
               d[i-1, j-1] + 1 // substitution
             )
like image 153
Mark Byers Avatar answered Nov 16 '22 03:11

Mark Byers


You have to modify how you update the dynamic programming table. In the original algorithm one considers the tails(or heads) of the two words that differ at the most by length one. The update is the minimum of all such possibilities.

If you want to modify the algorithm such that changes in two adjacent locations count as one, the minimum above has to be computed over tails(or heads) that differ by at most two. You can extend this to larger neighborhoods but the complexity will increase exponentially in the size of that neighborhood.

You can generalize further and assign costs that depend on the character(s) deleted, inserted or substituted, but you have to make sure that the cost you assign to a pair-edit is lower than two single edits, otherwise the two single edits will always win.

Let the words be w1 and w2

dist(i,j) = min(
                dist(i-2,j-2) && w1(i-1,i) == w2(j-1,j) else
                dist(i-1,j-1) && w1(i) == w2(j) else
                dist(i,j-1)   + cost(w2(j)),
                dist(i-1,j)   + cost(w1(i)),
                dist(i-1,j-1) + cost(w1(i), w2(j)),
                dist(i, j-2)  + cost(w2(j-1,j)),
                dist(i-2, j)  + cost(w1(i-1,i)),
                dist(i-2,j-2) + cost(w1(i-1,i), w2(j-1,j))
                ) 

What I mean by the && is that those lines should be considered only if the conditions are satisfied.

like image 40
srean Avatar answered Nov 16 '22 03:11

srean