Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to know the operations made to calculate the Levenshtein distance between strings?

With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1 because "d" was inserted in the second string.

Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")? Thanks.

library(stringdist)
stringdist("abc abc","abcde acc") = 3

I would like to know that :

  • "d" was inserted

  • "e" was inserted

  • "b" was substitued into "c"

Or more simply, I would like to have the list ("d","e","c").

like image 327
yaki Avatar asked Jun 30 '19 20:06

yaki


2 Answers

With adist(), you can retrieve the operations:

drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

ins del sub 
  2   0   1 

From ?adist:

If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively.

like image 130
tmfmnk Avatar answered Oct 23 '22 19:10

tmfmnk


This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.

Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.

Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:

library(Biostrings)

dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

result = pairwiseAlignment(
    "abc abc", "abcde acc",
    substitutionMatrix = dist_mat,
    gapOpening = 1, gapExtension = 1
)

This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:

score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a"  "b"  "c"  "-"  "-"  " "  "a"  "b"  "c"
aligned(result)

— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).

like image 35
Konrad Rudolph Avatar answered Oct 23 '22 20:10

Konrad Rudolph