I've been using scipy's k-means for quite some time now, and I'm pretty happy about the way it works in terms of usability and efficiency. However, now I want to explore different k-means variants, more specifically, I'd like to apply spherical k-means in some of my problems.
Do you know any good Python implementation (i.e. similar to scipy's k-means) of spherical k-means? If not, how hard would it be to modify scipy's source code to adapt its k-means algorithm to be spherical?
Thank you.
Step-1: Select the value of K, to decide the number of clusters to be formed. Step-2: Select random K points which will act as centroids. Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.
Spherical k-means is a widely used clustering algorithm for sparse and high-dimensional data such as document vectors.
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found. It is presumptively known how many clusters there are. It is also known as the flat clustering algorithm. The number of clusters found from data by the method is denoted by the letter 'K' in K-means.
The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center.
In spherical k-means, you aim to guarantee that the centers are on the sphere, so you could adjust the algorithm to use the cosine distance, and should additionally normalize the centroids of the final result.
When using the Euclidean distance, I prefer to think of the algorithm as projecting the cluster centers onto the unit sphere in each iteration, i.e., the centers should be normalized after each maximization step.
Indeed, when the centers and data points are both normalized, there is a 1-to-1 relationship between the cosine distance and Euclidean distance
|a - b|_2 = 2 * (1 - cos(a,b))
The package jasonlaska/spherecluster modifies scikit-learns's k-means
into spherical k-means
and also provides another sphere clustering algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With