I looking to use the kmeans algorithm to cluster some data, but I would like to use a custom distance function. Is there any way I can change the distance function that is used by scikit-learn?
I would also settle for a different framework / module that would allow exchanging the distance function and can calculate the kmeans in parallel (I would like to speed up the calculation, which is a nice feature from scikit-learn)
Any suggestions?
Sklearn Kmeans uses the Euclidean distance. It has no metric parameter.
K-Means: Inertia Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters ( K ).
If the manhattan distance metric is used in k-means clustering, the algorithm still yields a centroid with the median value for each dimension, rather than the mean value for each dimension as for Euclidean distance.
tot. withinss : Total within-cluster sum of squares, i.e. sum(withinss). betweenss : The between-cluster sum of squares, i.e. $totss-tot. withinss$. size : The number of points in each cluster.
You could try spectral clustering algorithm which allows you to input your own distance matrix (calculated as you like).
Its performance has nothing to envy to K-means on convex boundaries, but does also the job on non-convex problems (detects connectivity). See more here.
The good news is that spectral clustering is also implemented in scikit-learn.
Hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With