Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in normalization of Levenshtein (edit) distance?

If the Levenshtein distance between two strings, s and t is given by L(s,t),

what is the difference in the impact on the resulting heuristic of the following two different normalization approaches?

  1. L(s,t) / [length(s) + length(t)]

  2. L(s,t) / max[length(s), length(t)]

  3. (L(s,t)*2) / [length(s) + length(t)]

I noticed that normalization approach 2 is recommended by the Levenshtein distance Wikipedia page but no mention is made of approach 1. Are both approaches equally valid? Just wondering if there is some mathematical justification for using one over the other.

Also, what is the difference between approach 1 and approach 3?

With the following example:

s = "Hi, my name is"
t = "Hello, my name is"
L(s,t) = 4
length(s) = 14 # (includes white space)
length(t) = 17 # (includes white space)

The Levenshtein distance given the three normalization algorithms above are:

[Approach 1]   4  /(14+17) = 0.129
[Approach 2]   4  /(17)    = 0.235
[Approach 3] (4*2)/(14+17) = 0.258
like image 768
user2205916 Avatar asked Dec 09 '16 18:12

user2205916


People also ask

What is normalized edit distance?

Abstract: The normalized edit distance is one of the distances derived from the edit distance. It is useful in some applications because it takes into account the lengths of the two strings compared. The normalized edit distance is not defined in terms of edit operations but rather in terms of the edit path.

Is edit distance same as Levenshtein?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

How do you calculate normalized Levenshtein distance?

If you want the result to be in the range [0, 1] , you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.

What is the Levenshtein edit distance between perspective and prospective?

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.


1 Answers

The effects of both variants should be nearly the same.

  • The second approach covers a range from 0 (strings are equal) to 1 (completely different)...
  • while the upper range in the first variant depends on the length of the strings: if the lengths are nearly equal the upper bound is 0.5, and increases on larger differences between the lengths.
like image 76
clemens Avatar answered Oct 01 '22 06:10

clemens