A few questions on stackoverflow mention this problem, but I haven't found a concrete solution. I have a square matrix which consists of cosine similarities (values between 0 and 1), for example: <pre class="prettyprint"><code> | A | B | C | D A | 1.0 | 0.1 | 0.6 | 0.4 B | 0.1 | 1.0 | 0.1 | 0.2 C | 0.6 | 0.1 | 1.0 | 0.7 D | 0.4 | 0.2 | 0.7 | 1.0 </code></pre> The square matrix can be of any size. I want to get clusters (I don't know how many) which maximize the values between the elements in the cluster. I.e. for the above example I should get two clusters: <ol> <li>B</li> <li>A, C, D</li> </ol> The reason being because C & D have the highest value between them, and A & C also have the highest value between them. An item can be in only one cluster. Recall is not that important for this problem, but precision is very important. It is acceptable to output three clusters: 1) B, 2) A, 3) C, D . But it is not acceptable to output any solution where B is in a cluster with another element. I think the diagonal (1.0) is confusing me. My data is guaranteed to have at least one cluster of 2+ elements, and I want to find as many clusters as possible without sacrificing precision. I will have to implement this in Python.

You can easily do this using spectral clustering. You can use the ready implementations such as the one in sklearn or implement it yourself. It is rather an easy algorithm. Here is a piece of code doing it in python using sklearn: <pre class="prettyprint"><code>import numpy as np from sklearn.cluster import SpectralClustering mat = np.matrix([[1.,.1,.6,.4],[.1,1.,.1,.2],[.6,.1,1.,.7],[.4,.2,.7,1.]]) SpectralClustering(2).fit_predict(mat) >>> array([0, 1, 0, 0], dtype=int32) </code></pre> As you can see it returns the clustering you have mentioned. The algorithm takes the top k eigenvectors of the input matrix corresponding to the largest eigenvalues, then runs the k-mean algorithm on the new matrix. Here is a simple code that does this for your matrix: <pre class="prettyprint"><code>from sklearn.cluster import KMeans eigen_values, eigen_vectors = np.linalg.eigh(mat) KMeans(n_clusters=2, init='k-means++').fit_predict(eigen_vectors[:, 2:4]) >>> array([0, 1, 0, 0], dtype=int32) </code></pre> Note that the implementation of the algorithm in the sklearn library may differ from mine. The example I gave is the simplest way of doing it. There are some good tutorial available online describing the spectral clustering algorithm in depth. For the cases you want the algorithm to figure out the number of clusters by itself, you can use Density Based Clustering Algorithms like DBSCAN: <pre class="prettyprint"><code>from sklearn.cluster import DBSCAN DBSCAN(min_samples=1).fit_predict(mat) array([0, 1, 2, 2]) </code></pre>

Clustering cosine similarity matrix

Tags:

python

math

cluster-analysis

scikit-learn

data-mining

A few questions on stackoverflow mention this problem, but I haven't found a concrete solution.

I have a square matrix which consists of cosine similarities (values between 0 and 1), for example:

  |  A  |  B  |  C  |  D
A | 1.0 | 0.1 | 0.6 |  0.4
B | 0.1 | 1.0 | 0.1 |  0.2
C | 0.6 | 0.1 | 1.0 |  0.7
D | 0.4 | 0.2 | 0.7 |  1.0

The square matrix can be of any size. I want to get clusters (I don't know how many) which maximize the values between the elements in the cluster. I.e. for the above example I should get two clusters:

B
A, C, D

The reason being because C & D have the highest value between them, and A & C also have the highest value between them.

An item can be in only one cluster.

Recall is not that important for this problem, but precision is very important. It is acceptable to output three clusters: 1) B, 2) A, 3) C, D . But it is not acceptable to output any solution where B is in a cluster with another element.

I think the diagonal (1.0) is confusing me. My data is guaranteed to have at least one cluster of 2+ elements, and I want to find as many clusters as possible without sacrificing precision.

I will have to implement this in Python.

308

asked May 06 '15 23:05

Stefan D

1 Answers

You can easily do this using spectral clustering. You can use the ready implementations such as the one in sklearn or implement it yourself. It is rather an easy algorithm.

Here is a piece of code doing it in python using sklearn:

import numpy as np
from sklearn.cluster import SpectralClustering
mat = np.matrix([[1.,.1,.6,.4],[.1,1.,.1,.2],[.6,.1,1.,.7],[.4,.2,.7,1.]])
SpectralClustering(2).fit_predict(mat)
>>> array([0, 1, 0, 0], dtype=int32)

As you can see it returns the clustering you have mentioned.

The algorithm takes the top k eigenvectors of the input matrix corresponding to the largest eigenvalues, then runs the k-mean algorithm on the new matrix. Here is a simple code that does this for your matrix:

from sklearn.cluster import KMeans
eigen_values, eigen_vectors = np.linalg.eigh(mat)
KMeans(n_clusters=2, init='k-means++').fit_predict(eigen_vectors[:, 2:4])
>>> array([0, 1, 0, 0], dtype=int32)

Note that the implementation of the algorithm in the sklearn library may differ from mine. The example I gave is the simplest way of doing it. There are some good tutorial available online describing the spectral clustering algorithm in depth.

For the cases you want the algorithm to figure out the number of clusters by itself, you can use Density Based Clustering Algorithms like DBSCAN:

from sklearn.cluster import DBSCAN
DBSCAN(min_samples=1).fit_predict(mat)
array([0, 1, 2, 2])

answered Oct 14 '22 13:10

Ashkan

Related questions
                            
                                Django-tastypie: Any example on file upload in POST?
                            
                                Strange behaviour with floats and string conversion
                            
                                Generate python bindings, what methods/programs to use [closed]
                            
                                Python - Find text using beautifulSoup then replace in original soup variable
                            
                                how to check which compiler was used to build Python
                            
                                Stopping Supervisor doesn't stop Celery workers
                            
                                What is the use of __kwdefaults__ which is a function object attribute?
                            
                                Sharing a lock between gunicorn workers
                            
                                Setting DataFrame column headers to a MultiIndex
                            
                                Django: list all reverse relations of a model
                            
                                remove italics in latex subscript in matplotlib
                            
                                fixing words with spaces using a dictionary look up in python?
                            
                                How to place minor ticks on symlog scale?
                            
                                Where in flask/gunicorn to initialize application
                            
                                Use cases for property vs. descriptor vs. __getattribute__
                            
                                Get count of related model efficiently in Django
                            
                                How to eliminate the extra minus sign when rounding negative numbers towards zero in numpy?
                            
                                Find out which font matplotlib uses
                            
                                Why does PyMongo throw AutoReconnect?
                            
                                Pandas MultiIndex: Divide all columns by one column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With