I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.
What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".
However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?
Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?
I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.
So far, my algorithm (pseudo code) is roughly along the lines of:
/* * The input lists represent the Strings I want to test for similarity. The * Strings are split apart based on new lines / carriage returns because Java * stack traces are not a giant one-line String, rather a multi-line String. * So each element in the input lists is a "line" from its stack trace. */ calculate similarity (List<String> list1, List<String> list2) { length1 = 0; length2 = 0; levenshteinDistance = 0; iterator1 = list1.iterator(); iterator2 = list2.iterator(); while ( iterator1.hasNext() && iterator2.hasNext() ) { // skip blank/empty lines because they are not interesting str1 = iterator1.next(); length1 += str1.length(); str2 = iterator2.next(); length2 += str2.length(); levensteinDistance += getLevenshteinDistance(str1, str2); } // handle the rest of the lines from the iterator that has not terminated difference = levenshteinDistance / Math.max(length1, length2); return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck! }
Using String. equals() :In Java, string equals() method compares the two given strings based on the data/content of the string. If all the contents of both the strings are same then it returns true. If any character does not match, then it returns false.
The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.
The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:
Take all the letters from both Strings an build a table like this:
Letter | String1 | String2
This can be a simple hash table or whatever.
In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).
It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:
C = (V1 * V2) / (|V1| * |V2|)
The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.
How close C is to 1 gives you how similar the Strings are.
It may seem complicated but it's just a few lines of code once you understand the idea.
Let's see an example: consider the strings
s1 = aabccdd s2 = ababcd
The table looks like:
Letter a b c d s1 2 1 2 2 s2 2 2 1 1
And thus:
C = (V1 * V2) / (|V1| * |V2|) = (2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877
So they are "pretty" similar.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With