Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing the edit distance

I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two will be counted as well. for eg: ed('has a', 'has a ball') = 4 and ed('has a', 'has a ball the is round') = 15. if we increase the length of the string, the edit distance will increase even though they are similar. Therefore, I can not set a value, what a good edit distance value should be.

like image 328
Naufal Khalid Avatar asked Aug 20 '17 14:08

Naufal Khalid


People also ask

How do you normalize edit distance?

To quantify the similarity, we normalize the edit distance. One approach is to calculate edit distance as usual and then divide it by the number of operations, usually called the length of the edit path. This is called edit distance with post-normalization.

How do you calculate normalized Levenshtein distance?

If you want the result to be in the range [0, 1] , you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.

What does minimum edit distance mean and why is it used?

• The minimum edit distance between two strings is defined as the minimum number. of editing operations (insertion, deletion, substitution) needed to transform one string into another.

What is the edit distance between?

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.


1 Answers

Yes, normalizing the edit distance is one way to put the differences between strings on a single scale from "identical" to "nothing in common".

A few things to consider:

  1. Whether or not the normalized distance is a better measure of similarity between strings depends on the application. If the question is "how likely is this word to be a misspelling of that word?", normalization is a way to go. If it's "how much has this document changed since the last version?", the raw edit distance may be a better option.
  2. If you want the result to be in the range [0, 1], you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.
  3. The normalized distance is not a metric, as it violates the triangle inequality.
like image 172
Anton Avatar answered Nov 10 '22 12:11

Anton