I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function.
For instance, in sklearn.cluster.AgglomerativeClustering
, the only thing I may do is enter an affinity matrix (which will be very memory-heavy). In order to build this very matrix, it is recommended to use sklearn.neighbors.kneighbors_graph
, but I don't understand how I can specify a distance function either between two points. Could someone enlighten me?
The k-means clustering algorithm uses the Euclidean distance [1,4] to measure the similarities between objects.
Clustering Distance metrics are important part of these kind of algorithm. In K-means, we select number of centroids that define number of clusters. Each data point will then be assigned to its nearest centroid using distance metric (Euclidean).
Clustering starts by computing a distance between every pair of units that you want to cluster. A distance matrix will be symmetric (because the distance between x and y is the same as the distance between y and x) and will have zeroes on the diagonal (because every item is distance zero from itself).
All of the scipy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata
:
import numpy as np from scipy.cluster.hierarchy import fclusterdata # a custom function that just computes Euclidean distance def mydist(p1, p2): diff = p1 - p2 return np.vdot(diff, diff) ** 0.5 X = np.random.randn(100, 2) fclust1 = fclusterdata(X, 1.0, metric=mydist) fclust2 = fclusterdata(X, 1.0, metric='euclidean') print(np.allclose(fclust1, fclust2)) # True
Valid inputs for the metric=
kwarg are the same as for scipy.spatial.distance.pdist
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With