How to specify a distance function for clustering?

Tags:

I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function.

For instance, in sklearn.cluster.AgglomerativeClustering, the only thing I may do is enter an affinity matrix (which will be very memory-heavy). In order to build this very matrix, it is recommended to use sklearn.neighbors.kneighbors_graph, but I don't understand how I can specify a distance function either between two points. Could someone enlighten me?

351

asked Nov 15 '15 16:11

Mark Morrisson

1 Answers

All of the scipy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np from scipy.cluster.hierarchy import fclusterdata  # a custom function that just computes Euclidean distance def mydist(p1, p2):     diff = p1 - p2     return np.vdot(diff, diff) ** 0.5  X = np.random.randn(100, 2)  fclust1 = fclusterdata(X, 1.0, metric=mydist) fclust2 = fclusterdata(X, 1.0, metric='euclidean')  print(np.allclose(fclust1, fclust2)) # True

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist.

157

answered Sep 21 '22 09:09

ali_m

Related questions
                            
                                How can I remove OnClickListeners from RecyclerView's ViewHolders when they are disposed?
                            
                                Why is this call to swap() ambiguous?
                            
                                Why address-of operator ('&') can be used with objects that are declared with the register storage class specifier in C++?
                            
                                Why fatal error: 'yaml.h' file not found when installing PyYAML?
                            
                                Custom JSONEncoder for requests.post
                            
                                Should I use async/await for every method that returns a Task
                            
                                Deep copy of a np.array of np.array
                            
                                Disable secure priv for data loading on MySQL
                            
                                EntityFramework : Invalid column name *_ID1
                            
                                How are Docker image names parsed?
                            
                                binary operator '/' cannot be applied to two 'Double' operands
                            
                                Push up content except some view when keyboard shown

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With