Cosine similarity vs Hamming distance [closed]

Tags:

To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".

My question: Do you have experience with these algorithms? Which one gives you better results?

In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += $counts1[$term];
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}

I don't want to use any other algorithm. I would only like to have help to decide between both.

And maybe someone can say something to how to improve the algorithms. Will you get better results if you filter out the stop words or common words?

I hope you can help me. Thanks in advance!

882

asked Jun 03 '09 16:06

caw

2 Answers

A Hamming distance should be done between two strings of equal length and with the order taken into account.

As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)

Here is a cosine similarity function of 2 arrays of words:

function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

It is fast (isset() instead of in_array() is a killer on large arrays).

As you can see, the results does not take into account the "magnitude" of each the word.

I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)

The best link about string similarity metrics: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

For further interesting readings:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298

109

answered Oct 18 '22 23:10

Toto

Unless I'm mistaken, I think you've got an algorithm halfway between the two algorithms. For Hamming distance, use:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += 1;
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}

(Note that you're only adding 1 for each matched element in the token vectors.)

And for cosine similarity, use:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $counts2 = array_count_values($terms2);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += $counts1[$term] * $counts2[$term];
    }
    return $totalScore / (count($terms1) * count($terms2));
}

(Note that you're adding the product of the token counts between the two documents.)

The main difference between the two is that cosine similarity will yield a stronger indicator when two documents have the same word multiple times in the documents, while Hamming distance doesn't care how often the individual tokens come up.

Edit: just noticed your query about removing function words etc. I do advise this if you're going to use cosine similarity - as function words are quite frequent (in English, at least), you might skew a result by not filtering them out. If you use Hamming distance, the effect will not be quite as great, but it could still be appreciable in some cases. Also, if you have access to a lemmatizer, it will reduce the misses when one document contains "galaxies" and the other contains "galaxy", for instance.

Whichever way you go, good luck!

answered Oct 18 '22 21:10

Mike

Related questions
                            
                                Posting JSON with jquery ajax to PHP
                            
                                What is the difference between SoapServer features option
                            
                                Which one is best CSV or JSON in order to import big data (PHP) [closed]
                            
                                Query on a many-to-many relationship using Doctrine with Symfony2
                            
                                Mailchimp API v3.0 change subscriber email
                            
                                yii2 ActiveRecord Find OrderBy with calculation
                            
                                PHP Error - Uploading a file
                            
                                Triggering __call() in PHP even when method exists
                            
                                Smarty plugin for NetBeans [closed]
                            
                                Use multiple classes in other namespaces
                            
                                How to convert LDAP timestamp to Unix timestamp
                            
                                PHP see only 20 uploading files at a time
                            
                                How to truncate an UTF8 string in PHP?
                            
                                Using an array as a default parameter in a PHP function
                            
                                PHP - extend method like extending a class
                            
                                Ubuntu pecl install pecl_http fail
                            
                                How to access the $container within middleware class in Slim v3?
                            
                                Match all substrings that end with 4 digits using regular expressions
                            
                                PHP 7.4.1 - PECL is not working (Trying to access array offset on value of type bool in PEAR/REST.php on line 181) [closed]
                            
                                WordPress: Unexpected response from the server. The file may have been uploaded successfully. Check in the Media Library or reload the page

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cosine similarity vs Hamming distance [closed]

Tags:

php

similarity

relationship

caw

People also ask

2 Answers

Toto

Mike

Recent Activity

Donate For Us