Cosine distance as vector distance function for k-means

Tags:

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited.

E.g. for the graph:

Sample graph

the vector:

v1 = {100, 50, 0 30, 0}

would mean that we spent:

100secs at vertex 1
50secs at vertex 2 and 
30secs at vertex 4

(vertices 3 & 5 where not visited, thus the 0s).

I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine_similarity is:

cosine simularity formula

as described here.

But I noticed the following. Assume k=2 and one of the vectors is:

v1 = {90,0,0,0,0}

In the process of solving the optimization problem of minimizing the total distance from candidate centroids, assume that at some point, 2 candidate centroids are:

c1 = {90,90,90,90,90}
c2 = {1000, 1000, 1000, 1000, 1000}

Running the cosine_distance formula for (v1, c1) and (v1, c2) we get exactly the same distance of 0.5527864045 for both.

I would assume that v1 is more similar (closer) to c1 than c2. Apparently this is not the case.

Q1. Why is this assumption wrong?

Q2. Is the cosine distance a correct distance function for this case?

Q3. What would be a better one given the nature of the problem?

622

asked Aug 07 '14 11:08

Thalis K.

1 Answers

Let's divide cosine similarity into parts and see how and why it works.

Cosine between 2 vectors - a and b - is defined as:

cos(a, b) = sum(a .* b) / (length(a) * length(b))

where .* is an element-wise multiplication. Denominator is here just for normalization, so let's simply call it L. With it our functions turns into:

cos(a, b) = sum(a .* b) / L

which, in its turn, may be rewritten as:

cos(a, b) = (a[1]*b[1] + a[2]*b[2] + ... + a[k]*b[k]) / L = 
          = a[1]*b[1]/L + a[2]*b[2]/L + ... + a[k]*b[k]/L

Let's get a bit more abstract and replace x * y / L with function g(x, y) (L here is constant, so we don't put it as function argument). Our cosine function thus becomes:

cos(a, b) = g(a[1], b[1]) + g(a[2], b[2]) + ... + g(a[n], b[n])

That is, each pair of elements (a[i], b[i]) is treated separately, and result is simply sum of all treatments. And this is good for your case, because you don't want different pairs (different vertices) to mess with each other: if user1 visited only vertex2 and user2 - only vertex1, then they have nothing in common, and similarity between them should be zero. What you actually don't like is how similarity between individual pairs - i.e. function g() - is calculated.

With cosine function similarity between individual pairs looks like this:

g(x, y) = x * y / L

where x and y represent time users spent on the vertex. And here's the main question: does multiplication represent similarity between individual pairs well? I don't think so. User who spent 90 seconds on some vertex should be close to user who spent there, say, 70 or 110 seconds, but much more far from users who spend there 1000 or 0 seconds. Multiplication (even normalized by L) is totally misleading here. What it even means to multiply 2 time periods?

Good news is that this is you who design similarity function. We have already decided that we are satisfied with independent treatment of pairs (vertices), and we only want individual similarity function g(x, y) to make something reasonable with its arguments. And what is reasonable function to compare time periods? I'd say subtraction is a good candidate:

g(x, y) = abs(x - y)

This is not similarity function, but instead distance function - the closer are values to each other, the smaller is result of g() - but eventually idea is the same, so we can interchange them when we need.

We may also want to increase impact of large mismatches by squaring the difference:

g(x, y) = (x - y)^2

Hey! We've just reinvented (mean) squared error! We can now stick to MSE to calculate distance, or we can proceed finding good g() function.

Sometimes we may want not increase, but instead smooth the difference. In this case we can use log:

g(x, y) = log(abs(x - y))

We can use special treatment for zeros like this:

g(x, y) = sign(x)*sign(y)*abs(x - y)   # sign(0) will turn whole expression to 0

Or we can go back from distance to similarity by inversing the difference:

g(x, y) = 1 / abs(x - y)

Note, that in recent options we haven't used normalization factor. In fact, you can come up with some good normalization for each case, or just omit it - normalization is not always needed or good. For example, in cosine similarity formula if you change normalization constant L=length(a) * length(b) to L=1, you will get different, but still reasonable results. E.g.

cos([90, 90, 90]) == cos(1000, 1000, 1000)  # measuring angle only
cos_no_norm([90, 90, 90]) < cos_no_norm([1000, 1000, 1000])  # measuring both - angle and magnitude

Summarizing this long and mostly boring story, I would suggest rewriting cosine similarity/distance to use some kind of difference between variables in two vectors.

answered Nov 12 '22 20:11

ffriend

Related questions
                            
                                Global Dynamic Supervisor in a cluster
                            
                                How to find cluster sizes in 2D numpy array?
                            
                                How does pytorch backprop through argmax?
                            
                                Is a Fuzzy C-Means algorithm available for Python?
                            
                                DBSCAN on spark : which implementation
                            
                                How do I predict new data's cluster after clustering training data?
                            
                                clustering very large dataset in R
                            
                                How do I create a radial cluster like the following code-example in Python?
                            
                                How to create a cluster plot in R?
                            
                                Assign new data point to cluster in kernel k-means (kernlab package in R)?
                            
                                TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'q')
                            
                                Newman's modularity clustering for graphs
                            
                                How to find the success rate of a clustering algorithm?
                            
                                Clustering with a distance matrix
                            
                                clustering with NA values in R
                            
                                Where to find a reliable K-medoid(Not k-means) open source software/tool? [closed]
                            
                                Python Clustering Algorithms
                            
                                News clustering
                            
                                Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters
                            
                                Graph Theory: Calculating Clustering Coefficient

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cosine distance as vector distance function for k-means

Tags:

distance

cluster-analysis

k-means

data-mining

cosine-similarity

Thalis K.

People also ask

1 Answers

ffriend

Recent Activity

Donate For Us