Possible Duplicate:
K-means algorithm variation with equal cluster size
EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that cover this one: https://stats.stackexchange.com/questions/8744/clustering-procedure-where-each-cluster-has-an-equal-number-of-points
My requirements
In a project I need to group n points (x,y) in k clusters of equal size (n / k). Where x and y are double floating numbers, n can range from 100 to 10000 and k can range from 2 to 100. Also k is known before the algorithm runs.
My experimentations
I started to resolve the problem by using the http://en.wikipedia.org/wiki/K-means_clustering algorithm, which work great and fast to produce exactly k clusters of roughly the same size.
But my problem is this, K-means produce clusters of roughly the same size, where I need the clusters to be exactly the same size (or to be more precise: I need them to have a size between floor(n / k) and ceil(n / k)).
Before you point it out to me, yes I tried the first answer here K-means algorithm variation with equal cluster size, which sounds like a good idea.
The main idea is to post process the array of cluster produce by K-means. From the biggest cluster up to the smallest. We reduce the size of the clusters that have more than n / k members by moving extra points to an other nearest cluster. Leaving alone the clusters that are already reduced.
Here is the pseudo code I implemented:
n is the number of point k is the number of cluster m = n / k (the ideal cluster size) c is the array of cluster after K-means c' = c sorted by size in descending order for each cluster i in c' where i = 1 to k - 1 n = size of cluster i - m (the number of point to move) loop n times find a point p in cluster i with minimal distance to a cluster j in c' where j > i move point p from cluster i to cluster j end loop recalculate centroids end for each
The problem with this algorithm is that near the end of the process (when i come close to k), we have to choose a cluster j in c' (where j > i because we need to leave alone the clusters already processed), but this cluster j we found can be far from cluster i, thus breaking the concept of cluster.
My question
Is there a post K-means algorithm or a K-means variant that can meet my requirements, or am I wrong from the beginning and I need to find an other clustering algorithm?
PS: I do not mind to implement the solution myself, but it would be great if I can use a library, and ideally in JAVA.
The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups.
Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn't work well when clusters are in different shapes such as elliptical clusters.
As the value of K increases, there will be fewer elements in the cluster. So average distortion will decrease. The lesser number of elements means closer to the centroid. So, the point where this distortion declines the most is the elbow point.
K-means computes k clusters by average approximation. Each cluster is defined by their computed center and thus is unique by definition. Sample assignment is made to cluster with closest distance from cluster center, also unique by definition. Thus in this sense there is NO OVERLAP.
Try this k-means variation:
Initialization:
k
centers from the dataset at random, or even better using kmeans++ strategyIn the end, you should have a paritioning that satisfies your requirements of the +-1 same number of objects per cluster (make sure the last few clusters also have the right number. The first m
clusters should have ceil
objects, the remainder exactly floor
objects.) Note that using a heap ensures the clusters remain convex: if they were no longer convex, there would have been a better swap candidate.
Iteration step:
Requisites: a list for each cluster with "swap proposals" (objects that would prefer to be in a different cluster).
E step: compute the updated cluster centers as in regular k-means
M step: Iterating through all points (either just one, or all in one batch)
Compute nearest cluster center to object / all cluster centers that are closer than the current clusters. If it is a different cluster:
The cluster sizes remain invariant (+- the ceil/floor difference), an objects are only moved from one cluster to another as long as it results in an improvement of the estimation. It should therefore converge at some point like k-means. It might be a bit slower (i.e. more iterations) though.
I do not know if this has been published or implemented before. It's just what I would try (if I would try k-means. there are much better clustering algorithms.)
Not being an expert on the topic, I once needed to come up with a simple algorithm to cluster locations on a map, where every point needed to be part of a cluster, and clusters were bound in several ways (not just in size (i.e. point count), but also in some other measures which depended on varying factors).
By first finding the "difficult" points, and then growing clusters from there, I got the best results. "difficult" points would be points that are hard to reach, e.g. because they would lie alone in the outskirts of the total area, or because they would help hit another cluster boundary condition more than other points. This helped to grow neatly aligning clusters, leaving very little loners and corresponding handwork to place them.
This may help you if your current algorithm would normally find these difficult points last.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With