I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so). Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost. Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction). I have tried multiple approaches: <ul> <li>K-means <ul> <li>Most should know how this works</li> <li>Average spare capacity of 32</li> <li>Runs in O(n^2)</li> </ul> </li> <li>Sorted list of a-b distances <ul> <li>I tried an alternative approach like so: <ol> <li>Initialise cluster points by randomly selecting points from the data</li> <li>Determine the distance matrix between every point and every cluster</li> <li>Flatten it into a list</li> <li>Sort the list</li> <li>Go from smallest to longest distance assigning points to clusters</li> <li>Assign clusters points until they reach 64, then no more can be assigned</li> <li>Stop iterating through the list once all points have been assigned</li> <li>Update the cluster centroid based on the assigned points</li> <li>Repeat steps 1 - 7 until the cluster locations converge (as in K-means)</li> <li>Collect cluster locations that are nearby into one cluster</li> </ol> </li> <li>This had an average spare capacity of approximately 0, by design</li> <li>This worked well for my test data set, but as soon as I expanded to the full set (30 million points) it took far too long, probably because we have to sort the full list <code>O(NlogN)</code> and then iterate over it until all points have been assigned <code>O(NK)</code> and then repeat that until convergence</li> </ul> </li> <li>Linear Programming <ul> <li>This was quite simple to implement using libraries, but also took far too long again because of the complexity</li> </ul> </li> </ul> I am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that. Let me know if I missed any information out.

Since you have both pieces already, my first new suggestion would be to partition the points with k-means for k = n/6400 (you can tweak this parameter) and then use integer programming on each super-cluster. When I get a chance I'll write up my other suggestion, which involves a randomly shifted quadtree dissection. Old pre-question-edit answer below. <hr> You seem more concerned with minimizing equipment and running time than having the tightest possible clusters, so here's a suggestion along those lines. The idea is to start with 1-node clusters and then use (almost) perfect matchings to pair clusters with each other, doubling the size. Do this 6 times to get clusters of 64. To compute the matching, we use the centroid of each cluster to represent it. Now we just need an approximate matching on a set of points in the Euclidean plane. With apologies to the authors of many fine papers on Euclidean matching, here's an O(n log n) heuristic. If there are two or fewer points, match them in the obvious way. Otherwise, choose a random point P and partition the other points by comparing their (alternate between x- and y-) coordinate with P (as in kd-trees), breaking ties by comparing the other coordinate. Assign P to a half with an odd number of points if possible. (If both are even, let P be unmatched.) Recursively match the halves.

Clustering while trying to minimise spare capacity

Tags:

algorithm

time-complexity

machine-learning

cluster-analysis

linear-programming

I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so).

Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost.

Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction).

I have tried multiple approaches:

K-means
- Most should know how this works
- Average spare capacity of 32
- Runs in O(n^2)
Sorted list of a-b distances
- I tried an alternative approach like so:
  1. Initialise cluster points by randomly selecting points from the data
  2. Determine the distance matrix between every point and every cluster
  3. Flatten it into a list
  4. Sort the list
  5. Go from smallest to longest distance assigning points to clusters
  6. Assign clusters points until they reach 64, then no more can be assigned
  7. Stop iterating through the list once all points have been assigned
  8. Update the cluster centroid based on the assigned points
  9. Repeat steps 1 - 7 until the cluster locations converge (as in K-means)
  10. Collect cluster locations that are nearby into one cluster
- This had an average spare capacity of approximately 0, by design
- This worked well for my test data set, but as soon as I expanded to the full set (30 million points) it took far too long, probably because we have to sort the full list O(NlogN) and then iterate over it until all points have been assigned O(NK) and then repeat that until convergence
Linear Programming
- This was quite simple to implement using libraries, but also took far too long again because of the complexity

I am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that.

Let me know if I missed any information out.

885

asked Feb 08 '19 17:02

Adam Dadvar

1 Answers

Since you have both pieces already, my first new suggestion would be to partition the points with k-means for k = n/6400 (you can tweak this parameter) and then use integer programming on each super-cluster. When I get a chance I'll write up my other suggestion, which involves a randomly shifted quadtree dissection.

Old pre-question-edit answer below.

You seem more concerned with minimizing equipment and running time than having the tightest possible clusters, so here's a suggestion along those lines.

The idea is to start with 1-node clusters and then use (almost) perfect matchings to pair clusters with each other, doubling the size. Do this 6 times to get clusters of 64.

To compute the matching, we use the centroid of each cluster to represent it. Now we just need an approximate matching on a set of points in the Euclidean plane. With apologies to the authors of many fine papers on Euclidean matching, here's an O(n log n) heuristic. If there are two or fewer points, match them in the obvious way. Otherwise, choose a random point P and partition the other points by comparing their (alternate between x- and y-) coordinate with P (as in kd-trees), breaking ties by comparing the other coordinate. Assign P to a half with an odd number of points if possible. (If both are even, let P be unmatched.) Recursively match the halves.

answered Sep 22 '22 14:09

David Eisenstat

Related questions
                            
                                Random number gen w/ seed acting non-deterministic
                            
                                Best algorithm for netting orders
                            
                                Choose best combinations of operators to find target number
                            
                                Find the shortest path in a graph visiting all nodes
                            
                                Longest substring that matches a string in an array
                            
                                Pairwise matching of tiles
                            
                                Binary Tree Maximum Path Sum, non-recursive, Time Limit Exceeded
                            
                                Iterate over itertools.product in different order without ever creating list
                            
                                Pandas: most efficient way to apply complex function over entire data frame
                            
                                How to quickly find sum of all pairs of elements in 2 different arrays
                            
                                Scheduling algorithm, finding all non overlapping intervals of set length
                            
                                Rapid change detection algorithm
                            
                                Algorithm for many images and their color palettes
                            
                                Word list generation (sorting, optimization)
                            
                                Algorithm to get minimum movement to avoid square overlap
                            
                                5 numbers equal 23
                            
                                Is it possible to adapt this lock-free 32-bit hash-table algorithm for 64-bit keys?
                            
                                Dividing K resources fairly to N people
                            
                                Bit-reversal algorithm by Rutkowska
                            
                                Fastest way to find smallest missing integer from list of integers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With