Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering words based on Distance Matrix

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.

like image 493
user2115183 Avatar asked Apr 26 '13 22:04

user2115183


2 Answers

You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm. DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices. You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.

There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.

disclosure: I'm a scikit-learn core dev.

like image 117
Andreas Mueller Avatar answered Oct 16 '22 08:10

Andreas Mueller


The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.

like image 24
Bastiaan van den Berg Avatar answered Oct 16 '22 08:10

Bastiaan van den Berg