Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering cosine similarity matrix

A few questions on stackoverflow mention this problem, but I haven't found a concrete solution.

I have a square matrix which consists of cosine similarities (values between 0 and 1), for example:

  |  A  |  B  |  C  |  D
A | 1.0 | 0.1 | 0.6 |  0.4
B | 0.1 | 1.0 | 0.1 |  0.2
C | 0.6 | 0.1 | 1.0 |  0.7
D | 0.4 | 0.2 | 0.7 |  1.0

The square matrix can be of any size. I want to get clusters (I don't know how many) which maximize the values between the elements in the cluster. I.e. for the above example I should get two clusters:

  1. B
  2. A, C, D

The reason being because C & D have the highest value between them, and A & C also have the highest value between them.

An item can be in only one cluster.

Recall is not that important for this problem, but precision is very important. It is acceptable to output three clusters: 1) B, 2) A, 3) C, D . But it is not acceptable to output any solution where B is in a cluster with another element.

I think the diagonal (1.0) is confusing me. My data is guaranteed to have at least one cluster of 2+ elements, and I want to find as many clusters as possible without sacrificing precision.

I will have to implement this in Python.

like image 308
Stefan D Avatar asked May 06 '15 23:05

Stefan D


People also ask

Can cosine similarity be used for clustering?

Soft cosines can be a great feature if you want to use a similarity metric that can help in clustering or classification of documents.

What is similarity matrix in clustering?

Cluster-Based Similarity Partitioning Algorithm For each input partition, an binary similarity matrix encodes the piecewise similarity between any two objects, that is, the similarity of one indicates that two objects are grouped into the same cluster and a similarity of zero otherwise.

Does Kmeans use cosine similarity?

Unfortunately no. Sklearn current implementation of k-means only uses Euclidean distances. The reason is K-means includes calculation to find the cluster center and assign a sample to the closest center, and Euclidean only have the meaning of the center among samples.


1 Answers

You can easily do this using spectral clustering. You can use the ready implementations such as the one in sklearn or implement it yourself. It is rather an easy algorithm.

Here is a piece of code doing it in python using sklearn:

import numpy as np
from sklearn.cluster import SpectralClustering
mat = np.matrix([[1.,.1,.6,.4],[.1,1.,.1,.2],[.6,.1,1.,.7],[.4,.2,.7,1.]])
SpectralClustering(2).fit_predict(mat)
>>> array([0, 1, 0, 0], dtype=int32)

As you can see it returns the clustering you have mentioned.

The algorithm takes the top k eigenvectors of the input matrix corresponding to the largest eigenvalues, then runs the k-mean algorithm on the new matrix. Here is a simple code that does this for your matrix:

from sklearn.cluster import KMeans
eigen_values, eigen_vectors = np.linalg.eigh(mat)
KMeans(n_clusters=2, init='k-means++').fit_predict(eigen_vectors[:, 2:4])
>>> array([0, 1, 0, 0], dtype=int32)

Note that the implementation of the algorithm in the sklearn library may differ from mine. The example I gave is the simplest way of doing it. There are some good tutorial available online describing the spectral clustering algorithm in depth.

For the cases you want the algorithm to figure out the number of clusters by itself, you can use Density Based Clustering Algorithms like DBSCAN:

from sklearn.cluster import DBSCAN
DBSCAN(min_samples=1).fit_predict(mat)
array([0, 1, 2, 2])
like image 75
Ashkan Avatar answered Oct 14 '22 13:10

Ashkan