Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalise Levenshtein distance for maximum alignment length rather than for string length?

Problem: A few R packages feature Levenshtein distance implementations for computing the similarity of two strings, e.g. http://finzi.psych.upenn.edu/R/library/RecordLinkage/html/strcmp.html. The distances computed can easily be normalised for string length, e.g. by dividing the Levenshtein distance by the length of the longest string involved or by dividing it by the mean of the lengths of the two strings. For some applications in linguistics (e.g. dialectometry and receptive multilingualism research), however, it is recommended that the raw Levenshtein distance be normalised for the length of the longest least-cost alignment (Heeringa, 2004: 130-132). This tends to produce distance measures that make more sense from a perceptual-linguistic point of view.

Example: The German string "tsYklUs" (Zyklus = cycle) can be converted into its Swedish cognate "sYkEl" (cyckel = (bi)cycle) in a 7-slot alignment with two insertions (I) and two substitutions (S) for a total transformation cost of 4. Normalised Levenshtein distance: 4/7

(A)

t--s--Y--k--l--U--s
---s--Y--k--E--l---
===================
I-----------S--S--I = 4

It is also possible to convert the strings in an 8-slot alignment with 3 insertions (I) and 1 deletion (D), also for a total alignment cost of 4. Normalised Levenshtein distance: 4/8

(B)

t--s--Y--k-----l--U--S
---s--Y--k--E--l------
======================
I-----------D-----I--I = 4

The latter alignment makes more sense linguistically, because it aligns the [l]-phonemes with each other rather than with the [E] and [U] vowels.

Question: Does anyone know of any R function that would allow me to normalise Levenshtein distances for the longest least-cost alignment rather than for string length proper? Thanks for your input!

Reference: W.J. Heeringa (2004), Measuring dialect pronunciation differences using Levenshtein distance. PhD thesis, University of Groningen. http://www.let.rug.nl/~heeringa/dialectology/thesis/

Edit - Solution: I think I figured out a solution. The adist function can return the alignment and seems to default to the longest low-cost alignment. To take up the example above, here's the alignment associated with sykel to tsyklus:

> attr(adist("sykel", "tsyklus", counts = TRUE), "trafos")
     [,1]      
[1,] "IMMMDMII"

To compute length-normalised distances as recommended by Heeringa (2004), we can write a modest function:

normLev.fnc <- function(a, b) {
  drop(adist(a, b) / nchar(attr(adist(a, b, counts = TRUE), "trafos")))
}

For the example above, this returns

> normLev.fnc("sykel", "tsyklus")
[1] 0.5

This function also returns the correct normalised distances for Heeringa's (2004: 131) examples:

> normLev.fnc("bine", "bEi")
[1] 0.6
> normLev.fnc("kaninçen", "konEin")
[1] 0.5555556
> normLev.fnc("kenEeri", "kenArje")
[1] 0.5

To compare several pairs of strings:

> L1 <- c("bine", "kaninçen", "kenEeri")
> L2 <- c("bEi",  "konEin", "kenArje")
> diag(normLev.fnc(L1, L2))
[1] 0.6000000 0.5555556 0.5000000
like image 720
jvh_ch Avatar asked Apr 13 '12 12:04

jvh_ch


People also ask

How do you calculate normalized Levenshtein distance?

If you want the result to be in the range [0, 1] , you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.

What will be the Levenshtein distance between two strings room and Roomroom?

The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance.

What is the difference between edit distance and Levenshtein distance?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

What is the difference between Hamming distance and Levenshtein distance?

The Hamming distance is the number of positions at which the corresponding symbols in the two strings are different. The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

What are three major operations of Levenshtein edit distance?

Most commonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance .


1 Answers

In case any linguists stumble upon this post, I'd like to point out that the algorithms provided by the RecordLinkage package are not necessarily optimal for comparing non-ASCII strings, e.g.:

> levenshteinSim("väg", "way")
[1] -0.3333333
> levenshteinDist("väg", "way")
[1] 4
> levenshteinDist("väg", "wäy")
[1] 2
> levenshteinDist("väg", "wüy")
[1] 3
like image 144
jvh_ch Avatar answered Oct 03 '22 16:10

jvh_ch