Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K-means algorithm variation with equal cluster size

I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups.

Is there a variation of this algorithm or a different one that allows for an equal count of members for all clusters?

See also: Group n points in k clusters of equal size

like image 627
pixelistik Avatar asked Mar 27 '11 21:03

pixelistik


People also ask

Does K-means produce equal size clusters?

The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups.

How many clusters should you use the in K-means analysis?

The Silhouette Method The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k. This also suggests an optimal of 2 clusters.

Is K-means algorithm and K-means clustering same?

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by 'K' in K-means.

How do you calculate cluster variation?

In plain English, the cluster variance is the coordinate-wise squared deviations from the mean of the cluster of all the observations belonging to that cluster. The total within cluster scatter (for the entire set of observations) is simply W=K∑k=1∑xi∈Ck‖xi−ˉxk‖2 for K clusters and N observations with K<N.


3 Answers

The ELKI data mining framework has a tutorial on equal-size k-means.

This is not a particulary good algorithm, but it's an easy enough k-means variation to write a tutorial for and teach people how to implement their own clustering algorithm variation; and apparently some people really need their clusters to have the same size, although the SSQ quality will be worse than with regular k-means.

In ELKI 0.7.5, you can select this algorithm as tutorial.clustering.SameSizeKMeansAlgorithm.

like image 32
Erich Schubert Avatar answered Sep 16 '22 15:09

Erich Schubert


This might do the trick: apply Lloyd's algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < jk) off to j and recompute the centroid i (but don't recompute the cluster) until the cluster size is n / k.

The complexity of this postprocessing step is O(k² n lg n).

like image 158
Fred Foo Avatar answered Sep 20 '22 15:09

Fred Foo


Just in case anyone wants to copy and paste a short function here you go - basically running KMeans then finding the minimal matching of points to clusters under the constraint of maximal points assigned to cluster (cluster size)

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
import numpy as np


def get_even_clusters(X, cluster_size):
    n_clusters = int(np.ceil(len(X)/cluster_size))
    kmeans = KMeans(n_clusters)
    kmeans.fit(X)
    centers = kmeans.cluster_centers_
    centers = centers.reshape(-1, 1, X.shape[-1]).repeat(cluster_size, 1).reshape(-1, X.shape[-1])
    distance_matrix = cdist(X, centers)
    clusters = linear_sum_assignment(distance_matrix)[1]//cluster_size
    return clusters
like image 42
Eyal Shulman Avatar answered Sep 18 '22 15:09

Eyal Shulman