Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating a relative Levenshtein distance - make sense?

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".

Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters...

What I am now doing is taking the distance / length to get a distance that better reflects what percentage of the word has been changed.

Is that a valid/proven approach? Or is it plain stupid?

like image 775
Joseph Tura Avatar asked Oct 06 '10 19:10

Joseph Tura


2 Answers

Is Levenshtein distance supposed to be used as an absolute value?

It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".

Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?

Are you doing spell-checking? or conforming invalid input to a known set of values? What are your priorities?

  • Minimize false positives (try to make sure all suggested words are very "similar", and list of suggestions is short)
  • Minimize false negatives (try to make sure that the string the user intended is in the list of suggestions, even if it makes the list long)
  • Maximize average matching accuracy

You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.

It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:

The Jaro distance dj of two given strings s1 and s2 is

(m / |s1| + m / |s2| + (m - t) / m) / 3

where:

  • m is the number of matching characters
  • t is the number of transpositions

Jaro–Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length l.

like image 136
LarsH Avatar answered Oct 16 '22 20:10

LarsH


The levenshtein distance is a relative value between two words. Comparing the LD to the length is not relevant eg

cat -> scat = 1 (75% similar??)

difference -> differences = 1 (90% similar??)

Both these words have lev distances of 1 ie they differ by one character, but when compared to their lengths the second set would appear to be 'more' similar.

I use soundexing to rank words that have the same lev distance eg

cat and fat both have a LD of 1 relative to kat, but the word is more likely to be kat than fat when using soundex (assuming the word is incrrectly spelt, not incorrectly typed!)

So the short answer is just use the lev distance to determine the similarity.

like image 41
James Westgate Avatar answered Oct 16 '22 21:10

James Westgate