To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".
My question: Do you have experience with these algorithms? Which one gives you better results?
In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += $counts1[$term];
}
return $totalScore * 500 / (count($terms1) * count($terms2));
}
I don't want to use any other algorithm. I would only like to have help to decide between both.
And maybe someone can say something to how to improve the algorithms. Will you get better results if you filter out the stop words or common words?
I hope you can help me. Thanks in advance!
The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.
Cosine Distance = 1 — Cosine Similarity The intuition behind this is that if 2 vectors are perfectly the same then the similarity is 1 (angle=0 hence 𝑐𝑜𝑠(𝜃)=1) and thus, distance is 0 (1–1=0).
The Euclidean distance corresponds to the L2-norm of a difference between vectors. The cosine similarity is proportional to the dot product of two vectors and inversely proportional to the product of their magnitudes.
Why do we use cosine similarity in NLP? In NLP, Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it calculates the cosine of the angle between two vectors projected in a multi-dimensional space.
A Hamming distance should be done between two strings of equal length and with the order taken into account.
As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)
Here is a cosine similarity function of 2 arrays of words:
function cosineSimilarity($tokensA, $tokensB)
{
$a = $b = $c = 0;
$uniqueTokensA = $uniqueTokensB = array();
$uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
foreach ($uniqueMergedTokens as $token) {
$x = isset($uniqueTokensA[$token]) ? 1 : 0;
$y = isset($uniqueTokensB[$token]) ? 1 : 0;
$a += $x * $y;
$b += $x;
$c += $y;
}
return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
It is fast (isset()
instead of in_array()
is a killer on large arrays).
As you can see, the results does not take into account the "magnitude" of each the word.
I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)
The best link about string similarity metrics: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
For further interesting readings:
http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298
Unless I'm mistaken, I think you've got an algorithm halfway between the two algorithms. For Hamming distance, use:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += 1;
}
return $totalScore * 500 / (count($terms1) * count($terms2));
}
(Note that you're only adding 1 for each matched element in the token vectors.)
And for cosine similarity, use:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$counts2 = array_count_values($terms2);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += $counts1[$term] * $counts2[$term];
}
return $totalScore / (count($terms1) * count($terms2));
}
(Note that you're adding the product of the token counts between the two documents.)
The main difference between the two is that cosine similarity will yield a stronger indicator when two documents have the same word multiple times in the documents, while Hamming distance doesn't care how often the individual tokens come up.
Edit: just noticed your query about removing function words etc. I do advise this if you're going to use cosine similarity - as function words are quite frequent (in English, at least), you might skew a result by not filtering them out. If you use Hamming distance, the effect will not be quite as great, but it could still be appreciable in some cases. Also, if you have access to a lemmatizer, it will reduce the misses when one document contains "galaxies" and the other contains "galaxy", for instance.
Whichever way you go, good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With