what is a good metric for deciding if 2 Strings are "similar enough"

Tags:

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.

What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".

However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?

Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?

UPDATE

I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.

So far, my algorithm (pseudo code) is roughly along the lines of:

/*  * The input lists represent the Strings I want to test for similarity. The  * Strings are split apart based on new lines / carriage returns because Java  * stack traces are not a giant one-line String, rather a multi-line String.  * So each element in the input lists is a "line" from its stack trace.  */ calculate similarity (List<String> list1, List<String> list2) {      length1 = 0;     length2 = 0;     levenshteinDistance = 0;      iterator1 = list1.iterator();     iterator2 = list2.iterator();      while ( iterator1.hasNext() && iterator2.hasNext() ) {          // skip blank/empty lines because they are not interesting         str1 = iterator1.next();    length1 += str1.length();         str2 = iterator2.next();    length2 += str2.length();          levensteinDistance += getLevenshteinDistance(str1, str2);     }      // handle the rest of the lines from the iterator that has not terminated      difference = levenshteinDistance / Math.max(length1, length2);      return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck! }

833

asked Dec 09 '11 20:12

Hristo

1 Answers

How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:

Take all the letters from both Strings an build a table like this:

Letter | String1 | String2

This can be a simple hash table or whatever.

In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).

It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:

C = (V1 * V2) / (|V1| * |V2|)

The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.

How close C is to 1 gives you how similar the Strings are.

It may seem complicated but it's just a few lines of code once you understand the idea.

Let's see an example: consider the strings

s1 = aabccdd s2 = ababcd

The table looks like:

Letter a b c d s1     2 1 2 2 s2     2 2 1 1

And thus:

C = (V1 * V2) / (|V1| * |V2|) =  (2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877

So they are "pretty" similar.

answered Sep 27 '22 17:09

Tudor

Related questions
                            
                                Clean way to catch all errors thrown in an MVC 3.0 application?
                            
                                What inspired Ruby's =begin .. =end comment block syntax?
                            
                                How do I load a string into a FileStream without going to disk?
                            
                                Does realloc keep the memory alignment of posix_memalign?
                            
                                Chrome (maybe Safari?) fires "blur" twice on input fields when browser loses focus
                            
                                Is there such a thing as bidirectional maps in Scala?
                            
                                A simple scatterplot example in D3.js?
                            
                                Run Visual Studio on Microsoft Surface?
                            
                                Use client-side MVC/MVVM patterns with MVC server-side pattern
                            
                                Parsing UTF-8/unicode strings with lxml HTML
                            
                                Code formatter / beautifier for bash (in command line)?
                            
                                Set title for contextual action bar

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With