I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
hospital
haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.
To compute the Hamming distance between two strings, you compare the characters of each position in the string. The number of unequal characters is the Hamming distance. An advantage of the Hamming Distance is that it is very fast and simple to do this position-wise comparison.
Levenshtein distance. A metric for measuring similarity between two strings. It is equal to the minimum number of operations required to transform a given string into another one.
To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
ints
compare much faster than chars
. Code (it works the exact same if you replace int[]
with String
in the parameter declarations:
/// <summary> /// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of /// integers, where each integer represents the code point of a character in the source string. /// Includes an optional threshhold which can be used to indicate the maximum allowable distance. /// </summary> /// <param name="source">An array of the code points of the first string</param> /// <param name="target">An array of the code points of the second string</param> /// <param name="threshold">Maximum allowable distance</param> /// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns> public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) { int length1 = source.Length; int length2 = target.Length; // Return trivial case - difference in string lengths exceeds threshhold if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; } // Ensure arrays [i] / length1 use shorter length if (length1 > length2) { Swap(ref target, ref source); Swap(ref length1, ref length2); } int maxi = length1; int maxj = length2; int[] dCurrent = new int[maxi + 1]; int[] dMinus1 = new int[maxi + 1]; int[] dMinus2 = new int[maxi + 1]; int[] dSwap; for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; } int jm1 = 0, im1 = 0, im2 = -1; for (int j = 1; j <= maxj; j++) { // Rotate dSwap = dMinus2; dMinus2 = dMinus1; dMinus1 = dCurrent; dCurrent = dSwap; // Initialize int minDistance = int.MaxValue; dCurrent[0] = j; im1 = 0; im2 = -1; for (int i = 1; i <= maxi; i++) { int cost = source[im1] == target[jm1] ? 0 : 1; int del = dCurrent[im1] + 1; int ins = dMinus1[i] + 1; int sub = dMinus1[im1] + cost; //Fastest execution for min value of 3 integers int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del); if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2]) min = Math.Min(min, dMinus2[im2] + cost); dCurrent[i] = min; if (min < minDistance) { minDistance = min; } im1++; im2++; } jm1++; if (minDistance > threshold) { return int.MaxValue; } } int result = dCurrent[maxi]; return (result > threshold) ? int.MaxValue : result; }
Where Swap
is:
static void Swap<T>(ref T arg1,ref T arg2) { T temp = arg1; arg1 = arg2; arg2 = temp; }
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b) { if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) { return 0; } if (String.IsNullOrEmpty(a)) { return b.Length; } if (String.IsNullOrEmpty(b)) { return a.Length; } int lengthA = a.Length; int lengthB = b.Length; var distances = new int[lengthA + 1, lengthB + 1]; for (int i = 0; i <= lengthA; distances[i, 0] = i++); for (int j = 0; j <= lengthB; distances[0, j] = j++); for (int i = 1; i <= lengthA; i++) for (int j = 1; j <= lengthB; j++) { int cost = b[j - 1] == a[i - 1] ? 0 : 1; distances[i, j] = Math.Min ( Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1), distances[i - 1, j - 1] + cost ); } return distances[lengthA, lengthB]; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With