Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I measure string similarity between sentences?

I have the following task.

Given is a list of strings like so:

        var strings = [
            'Steve jobs created the iPod when he was at Apple',
            'I really like the new Macbook by Apple',
            'Jony Ive was concerned being fired by Steve Jobs after his return to Apple',
            'The new Macbook has just one USB-C type connector',
            'I like bananas',
            'The brezels I can buy in my local store are much better than the ones in the supermarket',
            'the',
            'foo',
            'Steve'
        ];

I now want to compare each string with each other and for each comparison I want to find out how similar they are to each other on a scale from 0-1 (or 0%-100%).

So, I googled around a little and found this: Similarity String Comparison in Java

So, I followed the instruction there, and ported the method similarity(String s1, String s2) to JavaScript:

        function similarity(s1, s2) {
            var longer = s1;
            var shorter = s2;
            if (s1.length < s2.length) {
                longer = s2;
                shorter = s1;
            }
            var longerLength = longer.length;
            if (longerLength == 0) {
                return 1.0;
            }
            return (longerLength - longer.LevenshteinDistance(shorter)) / longerLength;
        }

As comparison algorithm I used Levenshtein:

        String.prototype.LevenshteinDistance = function (s2) {
            var array = new Array(this.length + 1);
            for (var i = 0; i < this.length + 1; i++)
                array[i] = new Array(s2.length + 1);

            for (var i = 0; i < this.length + 1; i++)
                array[i][0] = i;
            for (var j = 0; j < s2.length + 1; j++)
                array[0][j] = j;

            for (var i = 1; i < this.length + 1; i++) {
                for (var j = 1; j < s2.length + 1; j++) {
                    if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
                    else {
                        array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                        array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
                    }
                }
            }
            return array[this.length][s2.length];
        };

So, as a test I ran a full loop comparing each string with each other and printing the result like this:

            for (var i in strings){
                var s = strings[i];
                print('Checking string: "' + s + '"');
                for (var j in strings){
                    print('-----');
                    var s2 = strings[j];
                    print('vs "' + s2 + '"');
                    var sim = similarity(s, s2);
                    print('Similarity: ' + Math.round(sim*100) + '%');
                }
                print('<br>////// NEXT /////////////////////////////////////////////////<br>');
            }

Ok, now this is the result: https://jsfiddle.net/wxksfa4w/

Now, looking at the results I get some good matches but also some that are completely unrelated to each other, for example:

"Steve jobs created the iPod when he was at Apple" and "I like bananas" match for 13%?

"Steve jobs created the iPod when he was at Apple" and just "Steve" match for just 10% although the exact same word "Steve" is used as in the first sentence?

How can I get better semantic results? Is Levenshtein the wrong algorithm? From what I understood is that Levenshtein counts the numbers of steps of how to change sentence 1 to sentence 2. So, the length of a string seems to have a major impact on the result even if there is semantic similarity.

Any advice?

like image 929
Timo Ernst Avatar asked Mar 29 '15 02:03

Timo Ernst


1 Answers

You should probably be using the presence of words in both sentences as a high hint of similarity. A simple way of doing it is using each sentence as a bag of words and using tf-idf

like image 128
kar288 Avatar answered Oct 05 '22 12:10

kar288