Similarity String Comparison in Java

Computing the `editDistance()`:

The editDistance() function above is expected to calculate the edit distance between the two strings. There are several implementations to this step, each may suit a specific scenario better. The most common is the Levenshtein distance algorithm and we'll use it in our example below (for very large strings, other algorithms are likely to perform better).

Here's two options to calculate the edit distance:

You can use Apache Commons Text's implementation of Levenshtein distance: apply(CharSequence left, CharSequence rightt)
Implement it in your own. Below you'll find an example implementation.

Working example:

See online demo here.

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

Output:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

Yes, there are many well documented algorithms like:

Cosine similarity
Jaccard similarity
Dice's coefficient
Matching similarity
Overlap similarity
etc etc

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

Also check these projects:

Simmetrics
jtmt

I translated the Levenshtein distance algorithm into JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

There are indeed a lot of string similarity measures out there:

Levenshtein edit distance;
Damerau-Levenshtein distance;
Jaro-Winkler similarity;
Longest Common Subsequence edit distance;
Q-Gram (Ukkonen);
n-Gram distance (Kondrak);
Jaccard index;
Sorensen-Dice coefficient;
Cosine similarity;
...

You can find explanation and java implementation of these here: https://github.com/tdebatty/java-string-similarity

You could use Levenshtein distance to calculate the difference between two strings. http://en.wikipedia.org/wiki/Levenshtein_distance

Related questions
                            
                                Embedded MongoDB when running integration tests
                            
                                returning a Void object
                            
                                Weird Integer boxing in Java
                            
                                When does System.gc() do something?
                            
                                How to use `string.startsWith()` method ignoring the case?
                            
                                Map implementation with duplicate keys
                            
                                try/catch versus throws Exception
                            
                                C# vs Java generics [duplicate]
                            
                                Proper usage of Optional.ifPresent()
                            
                                How to randomly pick an element from an array
                            
                                java.lang.NoClassDefFoundError: Could not initialize class org.codehaus.groovy.vmplugin.v7.Java7
                            
                                How to configure Eclipse build path to use Maven dependencies?
                            
                                stuck at "Getting org.scala-sbt sbt 0.13.6 ..." when running sbt in terminal
                            
                                AndroidRuntime error: Parcel: unable to marshal value
                            
                                How can I read a text file in Android?
                            
                                Class Not Found Exception when running JUnit test
                            
                                Default fetch type for one-to-one, many-to-one and one-to-many in Hibernate
                            
                                What's the use of session.flush() in Hibernate
                            
                                The import android.support cannot be resolved
                            
                                When to use an assertion and when to use an exception

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Similarity String Comparison in Java

Tags:

java

string-comparison

People also ask

Computing the `editDistance()`:

Working example:

Recent Activity

Donate For Us

Similarity String Comparison in Java

Tags:

java

string-comparison

People also ask

Computing the editDistance():

Working example:

Related questions

Recent Activity

Donate For Us

Computing the `editDistance()`: