Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cosine similarity when one of vectors is all zeros

How to express the cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity )

when one of the vectors is all zeros?

v1 = [1, 1, 1, 1, 1]

v2 = [0, 0, 0, 0, 0]

When we calculate according to the classic formula we get division by zero:

Let d1 = 0 0 0 0 0 0
Let d2 = 1 1 1 1 1 1
Cosine Similarity (d1, d2) =  dot(d1, d2) / ||d1|| ||d2||dot(d1, d2) = (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) = 0

||d1|| = sqrt((0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 0

||d2|| = sqrt((1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2) = 2.44948974278

Cosine Similarity (d1, d2) = 0 / (0) * (2.44948974278)
                           = 0 / 0

I want to use this similarity measure in a clustering application. And I often will need to compare such vectors. Also [0, 0, 0, 0, 0] vs. [0, 0, 0, 0, 0]

Do you have any experience? Since this is a similarity (not a distance) measure should I use special case for

d( [1, 1, 1, 1, 1]; [0, 0, 0, 0, 0] ) = 0

d([0, 0, 0, 0, 0]; [0, 0, 0, 0, 0] ) = 1

what about

d([1, 1, 1, 0, 0]; [0, 0, 0, 0, 0] ) = ? etc.

like image 203
Sebastian Widz Avatar asked Nov 02 '14 13:11

Sebastian Widz


2 Answers

It is undefined.

Think you have a vector C that is not zero in place your zero vector. Multiply it by epsilon > 0 and let run epsilon to zero. The result will depend on C, so the function is not continuous when one of the vectors is zero.

like image 62
Gyro Gearloose Avatar answered Oct 11 '22 23:10

Gyro Gearloose


If you have 0 vectors, cosine is the wrong similarity function for your application.

Cosine distance is essentially equivalent to squared Euclidean distance on L_2 normalized data. I.e. you normalize every vector to unit length 1, then compute squared Euclidean distance.

The other benefit of Cosine is performance - computing it on very sparse, high-dimensional data is faster than Euclidean distance. It benefits from sparsity to the square, not just linear.

While you obviously can try to hack the similarity to be 0 when exactly one is zero, and maximal when they are identical, it won't really solve the underlying problems.

Don't choose the distance by what you can easily compute.

Instead, choose the distance such that the result has a meaning on your data. If the value is undefined, you don't have a meaning...

Sometimes, it may work to discard constant-0 data as meaningless data anyway (e.g. analyzing Twitter noise, and seeing a Tweet that is all numbers, no words). Sometimes it doesn't.

like image 38
Has QUIT--Anony-Mousse Avatar answered Oct 11 '22 22:10

Has QUIT--Anony-Mousse