I'm playing around with Levenshteins Edit Distance algorithm, and I want to extend this to count transpositions -- that is, exchanges of adjacent letters -- as 1 edit. The unmodified algorithm counts insertions, deletes or substitutions needed to reach a certain string from another. For instance, the edit distance from "KITTEN" to "SITTING" is 3. Here's the explanation from Wikipedia:
Following the same method, the edit distance from "CHIAR" to "CHAIR" is 2:
I would like to count this as "1 edit", since I only exchange two adjacent letters. How would I go about to do this?
For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2. 2. Similarly, for the second case of inserting last character of str2 into str1, we have to do - (1 + array[m][n-1]).
The edit-distance is the score of the best possible alignment between the two genetic sequences over all possible alignments. In this example, the second alignment is in fact optimal, so the edit-distance between the two strings is 7.
While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
You need one more case in the algorithm from Wikipedia:
if s[i] = t[j] then
d[i, j] := d[i-1, j-1]
else if i > 0 and j > 0 and s[i] = t[j - 1] and s[i - 1] = t[j] then
d[i, j] := minimum
(
d[i-2, j-2] + 1 // transpose
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + 1 // substitution
)
else
d[i, j] := minimum
(
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + 1 // substitution
)
You have to modify how you update the dynamic programming table. In the original algorithm one considers the tails(or heads) of the two words that differ at the most by length one. The update is the minimum of all such possibilities.
If you want to modify the algorithm such that changes in two adjacent locations count as one, the minimum above has to be computed over tails(or heads) that differ by at most two. You can extend this to larger neighborhoods but the complexity will increase exponentially in the size of that neighborhood.
You can generalize further and assign costs that depend on the character(s) deleted, inserted or substituted, but you have to make sure that the cost you assign to a pair-edit is lower than two single edits, otherwise the two single edits will always win.
Let the words be w1 and w2
dist(i,j) = min(
dist(i-2,j-2) && w1(i-1,i) == w2(j-1,j) else
dist(i-1,j-1) && w1(i) == w2(j) else
dist(i,j-1) + cost(w2(j)),
dist(i-1,j) + cost(w1(i)),
dist(i-1,j-1) + cost(w1(i), w2(j)),
dist(i, j-2) + cost(w2(j-1,j)),
dist(i-2, j) + cost(w1(i-1,i)),
dist(i-2,j-2) + cost(w1(i-1,i), w2(j-1,j))
)
What I mean by the &&
is that those lines should be considered only if the conditions are satisfied.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With