Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spherical k-means implementation in Python

I've been using scipy's k-means for quite some time now, and I'm pretty happy about the way it works in terms of usability and efficiency. However, now I want to explore different k-means variants, more specifically, I'd like to apply spherical k-means in some of my problems.

Do you know any good Python implementation (i.e. similar to scipy's k-means) of spherical k-means? If not, how hard would it be to modify scipy's source code to adapt its k-means algorithm to be spherical?

Thank you.

like image 670
Oriol Nieto Avatar asked Oct 07 '13 14:10

Oriol Nieto


People also ask

How implement k-means algorithm in Python?

Step-1: Select the value of K, to decide the number of clusters to be formed. Step-2: Select random K points which will act as centroids. Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.

What is spherical k?

Spherical k-means is a widely used clustering algorithm for sparse and high-dimensional data such as document vectors.

Which algorithm is used for k-means implementation?

The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found. It is presumptively known how many clusters there are. It is also known as the flat clustering algorithm. The number of clusters found from data by the method is denoted by the letter 'K' in K-means.

What is elbow Method k-means python?

The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center.


1 Answers

In spherical k-means, you aim to guarantee that the centers are on the sphere, so you could adjust the algorithm to use the cosine distance, and should additionally normalize the centroids of the final result.

When using the Euclidean distance, I prefer to think of the algorithm as projecting the cluster centers onto the unit sphere in each iteration, i.e., the centers should be normalized after each maximization step.

Indeed, when the centers and data points are both normalized, there is a 1-to-1 relationship between the cosine distance and Euclidean distance

|a - b|_2 = 2 * (1 - cos(a,b))

The package jasonlaska/spherecluster modifies scikit-learns's k-means into spherical k-means and also provides another sphere clustering algorithm.

like image 139
Jaska Avatar answered Oct 08 '22 20:10

Jaska