Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering given pairwise distances with unknown cluster number?

I have a set of objects {obj1, obj2, obj3, ..., objn}. I have calculated the pairwise distances of all possible pairs. The distances are stored in a n*n matrix M, with Mij being the distance between obji and objj. Then it is natural to see M is a symmetric matrix.

Now I wish to perform unsupervised clustering to these objects. After some searching, I find Spectral Clustering may be a good candidate, since it deals with such pairwise-distance cases.

However, after carefully reading its description, I find it unsuitable in my case, as it requires the number of clusters as the input. Before clustering, I don't know the number of clusters. It has to be figured out by the algorithm while performing the clustering, like DBSCAN.

Considering these, please suggest me some clustering methods that fit my case, where

  1. The pairwise distances are all available.
  2. The number of clusters is unknown.
like image 774
Sibbs Gambling Avatar asked Sep 20 '13 04:09

Sibbs Gambling


4 Answers

You can try multidimensional scaling (MDS). After you use MDS to convert the distance-like data into a geometrical picture, you can apply common clustering methods (like k-means) for clustering. See here and here for more.

like image 53
chaohuang Avatar answered Sep 20 '22 19:09

chaohuang


There are many possible clustering methods, and none of them can be considered "best", everything depends on the data, as always:

  • If you would like to use spectral clustering, but do not know the number of clusters before hand I suggest taking a look at the self-tuning spectral clustering or some methods of determining the number of clusters
  • If you consider other algorithms you could try:
    • DBSCAN
    • OPTICS
    • Density-Link-Clustering
    • Hierarchical clustering
like image 45
lejlot Avatar answered Sep 18 '22 19:09

lejlot


It's easy to do with the metric='precomputed' argument in sklearn clustering algorithms. You fit the model with the pairwise distance matrix rather than original features.

The idea how to do this is the following (for the case when you need to create a pairwise distance matrix too):

def my_metric(x, y):
   # implement your distance measure between x and y

def create_pairwise_dist(X_data):
   # create a matrix of pairwised distances between all elements in your X_data
   # for example with sklearn.metrics.pairwise.pairwise_distances
   # or scipy.spatial.distance.pdist
   # or your own code

X_data = <prepare your data matrix of features>
X_dist = create_pairwise_dist(X_data)

# then you can use DBSCAN

dbscan = DBSCAN(eps=1.3, metric='precomputed')
dbscan.fit(X_dist)
like image 43
alperovich Avatar answered Sep 21 '22 19:09

alperovich


Clustering methods that require the number of clusters a priori are much more common than those that try to estimate the number of clusters. You might get better answers at Cross Validated. In the meantime, however, a couple of recent approaches to the problem are:

  • Estimating the number of clusters in a data set via the gap statistic by Tibshirani, Walther and Hastie, which compares the change in within-cluster dispersion with the number of clusters against the expected change for an appropriate reference null distribution. There is an R implementation of this approach.
  • Cluster Validation by Prediction Strength by Tibshirani and Walther, which views "clustering as a supervised classification problem, in which we must also estimate the 'true' class labels. The resulting 'prediction strength' measure assesses how many groups can be predicted from the data, and how well."
like image 22
Simon Avatar answered Sep 21 '22 19:09

Simon