Problem: A few R packages feature Levenshtein distance implementations for computing the similarity of two strings, e.g. http://finzi.psych.upenn.edu/R/library/RecordLinkage/html/strcmp.html. The distances computed can easily be normalised for string length, e.g. by dividing the Levenshtein distance by the length of the longest string involved or by dividing it by the mean of the lengths of the two strings. For some applications in linguistics (e.g. dialectometry and receptive multilingualism research), however, it is recommended that the raw Levenshtein distance be normalised for the length of the longest least-cost alignment (Heeringa, 2004: 130-132). This tends to produce distance measures that make more sense from a perceptual-linguistic point of view. Example: The German string "tsYklUs" (Zyklus = cycle) can be converted into its Swedish cognate "sYkEl" (cyckel = (bi)cycle) in a 7-slot alignment with two insertions (I) and two substitutions (S) for a total transformation cost of 4. Normalised Levenshtein distance: 4/7 (A) <pre class="prettyprint"><code>t--s--Y--k--l--U--s ---s--Y--k--E--l--- =================== I-----------S--S--I = 4 </code></pre> It is also possible to convert the strings in an 8-slot alignment with 3 insertions (I) and 1 deletion (D), also for a total alignment cost of 4. Normalised Levenshtein distance: 4/8 (B) <pre class="prettyprint"><code>t--s--Y--k-----l--U--S ---s--Y--k--E--l------ ====================== I-----------D-----I--I = 4 </code></pre> The latter alignment makes more sense linguistically, because it aligns the [l]-phonemes with each other rather than with the [E] and [U] vowels. Question: Does anyone know of any R function that would allow me to normalise Levenshtein distances for the longest least-cost alignment rather than for string length proper? Thanks for your input! Reference: W.J. Heeringa (2004), Measuring dialect pronunciation differences using Levenshtein distance. PhD thesis, University of Groningen. http://www.let.rug.nl/~heeringa/dialectology/thesis/ Edit - Solution: I think I figured out a solution. The <code>adist</code> function can return the alignment and seems to default to the longest low-cost alignment. To take up the example above, here's the alignment associated with sykel to tsyklus: <pre class="prettyprint"><code>> attr(adist("sykel", "tsyklus", counts = TRUE), "trafos") [,1] [1,] "IMMMDMII" </code></pre> To compute length-normalised distances as recommended by Heeringa (2004), we can write a modest function: <pre class="prettyprint"><code>normLev.fnc <- function(a, b) { drop(adist(a, b) / nchar(attr(adist(a, b, counts = TRUE), "trafos"))) } </code></pre> For the example above, this returns <pre class="prettyprint"><code>> normLev.fnc("sykel", "tsyklus") [1] 0.5 </code></pre> This function also returns the correct normalised distances for Heeringa's (2004: 131) examples: <pre class="prettyprint"><code>> normLev.fnc("bine", "bEi") [1] 0.6 > normLev.fnc("kaninçen", "konEin") [1] 0.5555556 > normLev.fnc("kenEeri", "kenArje") [1] 0.5 </code></pre> To compare several pairs of strings: <pre class="prettyprint"><code>> L1 <- c("bine", "kaninçen", "kenEeri") > L2 <- c("bEi", "konEin", "kenArje") > diag(normLev.fnc(L1, L2)) [1] 0.6000000 0.5555556 0.5000000 </code></pre>

In case any linguists stumble upon this post, I'd like to point out that the algorithms provided by the RecordLinkage package are not necessarily optimal for comparing non-ASCII strings, e.g.: <pre class="prettyprint"><code>> levenshteinSim("väg", "way") [1] -0.3333333 > levenshteinDist("väg", "way") [1] 4 > levenshteinDist("väg", "wäy") [1] 2 > levenshteinDist("väg", "wüy") [1] 3 </code></pre>

How to normalise Levenshtein distance for maximum alignment length rather than for string length?

Tags:

levenshtein-distance

similarity

edit-distance

Problem: A few R packages feature Levenshtein distance implementations for computing the similarity of two strings, e.g. http://finzi.psych.upenn.edu/R/library/RecordLinkage/html/strcmp.html. The distances computed can easily be normalised for string length, e.g. by dividing the Levenshtein distance by the length of the longest string involved or by dividing it by the mean of the lengths of the two strings. For some applications in linguistics (e.g. dialectometry and receptive multilingualism research), however, it is recommended that the raw Levenshtein distance be normalised for the length of the longest least-cost alignment (Heeringa, 2004: 130-132). This tends to produce distance measures that make more sense from a perceptual-linguistic point of view.

Example: The German string "tsYklUs" (Zyklus = cycle) can be converted into its Swedish cognate "sYkEl" (cyckel = (bi)cycle) in a 7-slot alignment with two insertions (I) and two substitutions (S) for a total transformation cost of 4. Normalised Levenshtein distance: 4/7

(A)

t--s--Y--k--l--U--s
---s--Y--k--E--l---
===================
I-----------S--S--I = 4

It is also possible to convert the strings in an 8-slot alignment with 3 insertions (I) and 1 deletion (D), also for a total alignment cost of 4. Normalised Levenshtein distance: 4/8

(B)

t--s--Y--k-----l--U--S
---s--Y--k--E--l------
======================
I-----------D-----I--I = 4

The latter alignment makes more sense linguistically, because it aligns the [l]-phonemes with each other rather than with the [E] and [U] vowels.

Question: Does anyone know of any R function that would allow me to normalise Levenshtein distances for the longest least-cost alignment rather than for string length proper? Thanks for your input!

Reference: W.J. Heeringa (2004), Measuring dialect pronunciation differences using Levenshtein distance. PhD thesis, University of Groningen. http://www.let.rug.nl/~heeringa/dialectology/thesis/

Edit - Solution: I think I figured out a solution. The adist function can return the alignment and seems to default to the longest low-cost alignment. To take up the example above, here's the alignment associated with sykel to tsyklus:

> attr(adist("sykel", "tsyklus", counts = TRUE), "trafos")
     [,1]      
[1,] "IMMMDMII"

To compute length-normalised distances as recommended by Heeringa (2004), we can write a modest function:

normLev.fnc <- function(a, b) {
  drop(adist(a, b) / nchar(attr(adist(a, b, counts = TRUE), "trafos")))
}

For the example above, this returns

> normLev.fnc("sykel", "tsyklus")
[1] 0.5

This function also returns the correct normalised distances for Heeringa's (2004: 131) examples:

> normLev.fnc("bine", "bEi")
[1] 0.6
> normLev.fnc("kaninçen", "konEin")
[1] 0.5555556
> normLev.fnc("kenEeri", "kenArje")
[1] 0.5

To compare several pairs of strings:

> L1 <- c("bine", "kaninçen", "kenEeri")
> L2 <- c("bEi",  "konEin", "kenArje")
> diag(normLev.fnc(L1, L2))
[1] 0.6000000 0.5555556 0.5000000

720

asked Apr 13 '12 12:04

jvh_ch

1 Answers

In case any linguists stumble upon this post, I'd like to point out that the algorithms provided by the RecordLinkage package are not necessarily optimal for comparing non-ASCII strings, e.g.:

> levenshteinSim("väg", "way")
[1] -0.3333333
> levenshteinDist("väg", "way")
[1] 4
> levenshteinDist("väg", "wäy")
[1] 2
> levenshteinDist("väg", "wüy")
[1] 3

144

answered Oct 03 '22 16:10

jvh_ch

Related questions
                            
                                Architecture & Essential Components of StumbleUpon's Recommendation Engine
                            
                                What FFT descriptors should be used as feature to implement classification or clustering algorithm?
                            
                                Solr Custom Similarity
                            
                                How do I determine the longest similar portion of several strings?
                            
                                How to calculate the similarity of two line drawing images in swift
                            
                                Quickly check large database for edit-distance similarity
                            
                                Visual similarity search algorithm
                            
                                Is there an alternative to `difflib.get_close_matches()` that returns indexes (list positions) instead of a str list?
                            
                                Algorithm to find edit distance to all substrings
                            
                                Compute the similarity between two lists
                            
                                Image comparison with php + gd
                            
                                Computing symmetric Kullback-Leibler divergence between two documents
                            
                                Algorithm for finding similar images using an index
                            
                                Good mysql query to find similar values in a single column
                            
                                Detecting image equality at different resolutions
                            
                                Find similar ASCII character in Unicode
                            
                                Javascript text similarity algorithm
                            
                                What is the best algorithm for matching two string containing less than 10 words in latin script
                            
                                The reverse process of stemming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to normalise Levenshtein distance for maximum alignment length rather than for string length?

Tags:

levenshtein-distance

similarity

edit-distance

jvh_ch

People also ask

1 Answers

jvh_ch

Recent Activity

Donate For Us