I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so).
Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost.
Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction).
I have tried multiple approaches:
O(NlogN)
and then iterate over it until all points have been assigned O(NK)
and then repeat that until convergenceI am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that.
Let me know if I missed any information out.
To reduce the dimensionality of your data, you need to use fewer clusters than the number of original dimensions in the data.
The speed at which data is generated is another clustering challenge data scientists face. This problem isn't limited to the volume of data on a network. As networks generate new data at unprecedented speeds, they will have a harder time extracting it in real-time.
Clustering high-dimensional data has many challenges. These include the distance between points converging, the output becoming impossible to visualize, correlation skewing the location of the points, and the local feature relevance problem. The curse of dimensionality will come up repeatedly in data science.
Since you have both pieces already, my first new suggestion would be to partition the points with k-means for k = n/6400 (you can tweak this parameter) and then use integer programming on each super-cluster. When I get a chance I'll write up my other suggestion, which involves a randomly shifted quadtree dissection.
Old pre-question-edit answer below.
You seem more concerned with minimizing equipment and running time than having the tightest possible clusters, so here's a suggestion along those lines.
The idea is to start with 1-node clusters and then use (almost) perfect matchings to pair clusters with each other, doubling the size. Do this 6 times to get clusters of 64.
To compute the matching, we use the centroid of each cluster to represent it. Now we just need an approximate matching on a set of points in the Euclidean plane. With apologies to the authors of many fine papers on Euclidean matching, here's an O(n log n) heuristic. If there are two or fewer points, match them in the obvious way. Otherwise, choose a random point P and partition the other points by comparing their (alternate between x- and y-) coordinate with P (as in kd-trees), breaking ties by comparing the other coordinate. Assign P to a half with an odd number of points if possible. (If both are even, let P be unmatched.) Recursively match the halves.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With