Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate distance similarity measure of given 2 strings?

I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:

  • The real word: hospital
  • Mistaken word: haspita

Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.

How can I achieve this with performance being a key consideration?

like image 573
MonsterMMORPG Avatar asked Feb 26 '12 14:02

MonsterMMORPG


People also ask

How do you find the similarity between two strings?

Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.

How do you find the distance between two strings?

To compute the Hamming distance between two strings, you compare the characters of each position in the string. The number of unequal characters is the Hamming distance. An advantage of the Hamming Distance is that it is very fast and simple to do this position-wise comparison.

Which distance measure is commonly used to estimate the similarity between two strings?

Levenshtein distance. A metric for measuring similarity between two strings. It is equal to the minimum number of operations required to transform a given string into another one.

How is similarity calculated?

To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.


2 Answers

I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:

  • If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
  • It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
  • Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
  • Make sure your arrays slice accross the shorter word width.

Code (it works the exact same if you replace int[] with String in the parameter declarations:

/// <summary> /// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of /// integers, where each integer represents the code point of a character in the source string. /// Includes an optional threshhold which can be used to indicate the maximum allowable distance. /// </summary> /// <param name="source">An array of the code points of the first string</param> /// <param name="target">An array of the code points of the second string</param> /// <param name="threshold">Maximum allowable distance</param> /// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns> public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {      int length1 = source.Length;     int length2 = target.Length;      // Return trivial case - difference in string lengths exceeds threshhold     if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }      // Ensure arrays [i] / length1 use shorter length      if (length1 > length2) {         Swap(ref target, ref source);         Swap(ref length1, ref length2);     }      int maxi = length1;     int maxj = length2;      int[] dCurrent = new int[maxi + 1];     int[] dMinus1 = new int[maxi + 1];     int[] dMinus2 = new int[maxi + 1];     int[] dSwap;      for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }      int jm1 = 0, im1 = 0, im2 = -1;      for (int j = 1; j <= maxj; j++) {          // Rotate         dSwap = dMinus2;         dMinus2 = dMinus1;         dMinus1 = dCurrent;         dCurrent = dSwap;          // Initialize         int minDistance = int.MaxValue;         dCurrent[0] = j;         im1 = 0;         im2 = -1;          for (int i = 1; i <= maxi; i++) {              int cost = source[im1] == target[jm1] ? 0 : 1;              int del = dCurrent[im1] + 1;             int ins = dMinus1[i] + 1;             int sub = dMinus1[im1] + cost;              //Fastest execution for min value of 3 integers             int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);              if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])                 min = Math.Min(min, dMinus2[im2] + cost);              dCurrent[i] = min;             if (min < minDistance) { minDistance = min; }             im1++;             im2++;         }         jm1++;         if (minDistance > threshold) { return int.MaxValue; }     }      int result = dCurrent[maxi];     return (result > threshold) ? int.MaxValue : result; } 

Where Swap is:

static void Swap<T>(ref T arg1,ref T arg2) {     T temp = arg1;     arg1 = arg2;     arg2 = temp; } 
like image 85
Joshua Honig Avatar answered Sep 22 '22 22:09

Joshua Honig


What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.

Here's an implementation from the first site linked below:

private static int  CalcLevenshteinDistance(string a, string b)     {     if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {         return 0;     }     if (String.IsNullOrEmpty(a)) {         return b.Length;     }     if (String.IsNullOrEmpty(b)) {         return a.Length;     }     int  lengthA   = a.Length;     int  lengthB   = b.Length;     var  distances = new int[lengthA + 1, lengthB + 1];     for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);     for (int j = 0;  j <= lengthB;  distances[0, j] = j++);      for (int i = 1;  i <= lengthA;  i++)         for (int j = 1;  j <= lengthB;  j++)             {             int  cost = b[j - 1] == a[i - 1] ? 0 : 1;             distances[i, j] = Math.Min                 (                 Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),                 distances[i - 1, j - 1] + cost                 );             }     return distances[lengthA, lengthB];     } 
like image 30
Sergey Kalinichenko Avatar answered Sep 20 '22 22:09

Sergey Kalinichenko