I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups.
Is there a variation of this algorithm or a different one that allows for an equal count of members for all clusters?
See also: Group n points in k clusters of equal size
The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups.
The Silhouette Method The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k. This also suggests an optimal of 2 clusters.
K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by 'K' in K-means.
In plain English, the cluster variance is the coordinate-wise squared deviations from the mean of the cluster of all the observations belonging to that cluster. The total within cluster scatter (for the entire set of observations) is simply W=K∑k=1∑xi∈Ck‖xi−ˉxk‖2 for K clusters and N observations with K<N.
The ELKI data mining framework has a tutorial on equal-size k-means.
This is not a particulary good algorithm, but it's an easy enough k-means variation to write a tutorial for and teach people how to implement their own clustering algorithm variation; and apparently some people really need their clusters to have the same size, although the SSQ quality will be worse than with regular k-means.
In ELKI 0.7.5, you can select this algorithm as tutorial.clustering.SameSizeKMeansAlgorithm
.
This might do the trick: apply Lloyd's algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < j ≤ k) off to j and recompute the centroid i (but don't recompute the cluster) until the cluster size is n / k.
The complexity of this postprocessing step is O(k² n lg n).
Just in case anyone wants to copy and paste a short function here you go - basically running KMeans then finding the minimal matching of points to clusters under the constraint of maximal points assigned to cluster (cluster size)
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
import numpy as np
def get_even_clusters(X, cluster_size):
n_clusters = int(np.ceil(len(X)/cluster_size))
kmeans = KMeans(n_clusters)
kmeans.fit(X)
centers = kmeans.cluster_centers_
centers = centers.reshape(-1, 1, X.shape[-1]).repeat(cluster_size, 1).reshape(-1, X.shape[-1])
distance_matrix = cdist(X, centers)
clusters = linear_sum_assignment(distance_matrix)[1]//cluster_size
return clusters
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With