Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify a distance function for clustering?

Tags:

I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function.

For instance, in sklearn.cluster.AgglomerativeClustering, the only thing I may do is enter an affinity matrix (which will be very memory-heavy). In order to build this very matrix, it is recommended to use sklearn.neighbors.kneighbors_graph, but I don't understand how I can specify a distance function either between two points. Could someone enlighten me?

like image 351
Mark Morrisson Avatar asked Nov 15 '15 16:11

Mark Morrisson


People also ask

Which distance function is used in k-means clustering?

The k-means clustering algorithm uses the Euclidean distance [1,4] to measure the similarities between objects.

What is the use of distance function in clustering?

Clustering Distance metrics are important part of these kind of algorithm. In K-means, we select number of centroids that define number of clusters. Each data point will then be assigned to its nearest centroid using distance metric (Euclidean).

How do you cluster a distance matrix?

Clustering starts by computing a distance between every pair of units that you want to cluster. A distance matrix will be symmetric (because the distance between x and y is the same as the distance between y and x) and will have zeroes on the diagonal (because every item is distance zero from itself).


1 Answers

All of the scipy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np from scipy.cluster.hierarchy import fclusterdata  # a custom function that just computes Euclidean distance def mydist(p1, p2):     diff = p1 - p2     return np.vdot(diff, diff) ** 0.5  X = np.random.randn(100, 2)  fclust1 = fclusterdata(X, 1.0, metric=mydist) fclust2 = fclusterdata(X, 1.0, metric='euclidean')  print(np.allclose(fclust1, fclust2)) # True 

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist.

like image 157
ali_m Avatar answered Sep 21 '22 09:09

ali_m