I have a large data set that I would like to cluster. My trial run set size is 2,500 objects; when I run it on the 'real deal' I will need to handle at least 20k objects.
These objects have a cosine similarity between them. This cosine similarity does not satisfy the requirements of being a mathematical distance metric; it doesn't satisfy the triangle inequality.
I would like to cluster them in some "natural" way that puts similar objects together without needing to specify beforehand the number of clusters I expect.
Does anyone know of an algorithm that would do that? Really, I'm just looking for any algorithm that doesn't require a) a distance metric and b) a pre-specified number of clusters.
Many thanks!
This question has been asked before here: Clustering from the cosine similarity values (but this solution only offers K-means clustering), and here: Effective clustering of a similarity matrix (but this solution was rather vague)
Apache mahout has a number of clustering algorithms, including some which don't require you to specify N and which allow you to specify the distance metric.
Mean shift clustering is similar to k-means but without a pre specified number of clusters https://cwiki.apache.org/confluence/display/MAHOUT/Mean+Shift+Clustering.
Then more generally, if you would like to try a variety of algorithms, there is an absolute wealth of sophisticated packages available for R (including a few variational Bayesian implementations of EM which will select the best number of clusters) which have proved very useful for some of my research in the past: http://cran.r-project.org/web/views/Cluster.html.
Actually most algorithms that require a "distance function" do not have the requirement for it to be metric.
DBSCAN can be generalized (see Wikipedia) to a version where it even abstracted from distance, it just needs to have some kind of "dense" notion. (DBSCAN also does not need to know the number of clusters beforehand)
But even for k-means - which has rather strict requirements on the distance, even beyond metrical - there is a variant called spherical k-means.
Anyway, in a database context, the full requirements of "metric" are utopic. In any real world data, there may be two records with the same coordinates, so at most you would have a pseudo-metric. The triangle inequality mostly plays a role for optimization (e.g. by using an M-tree index, which has strict triangle inequality requirements) or accelerated k-means exploiting this property.
You could also try Affinity Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation). The algorithm takes a similarity matrix as input, and can also, I believe, automatically adjust the number of cluster centroids.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With