Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering while trying to minimise spare capacity

I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so).

Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost.

Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction).

I have tried multiple approaches:

  • K-means
    • Most should know how this works
    • Average spare capacity of 32
    • Runs in O(n^2)
  • Sorted list of a-b distances
    • I tried an alternative approach like so:
      1. Initialise cluster points by randomly selecting points from the data
      2. Determine the distance matrix between every point and every cluster
      3. Flatten it into a list
      4. Sort the list
      5. Go from smallest to longest distance assigning points to clusters
      6. Assign clusters points until they reach 64, then no more can be assigned
      7. Stop iterating through the list once all points have been assigned
      8. Update the cluster centroid based on the assigned points
      9. Repeat steps 1 - 7 until the cluster locations converge (as in K-means)
      10. Collect cluster locations that are nearby into one cluster
    • This had an average spare capacity of approximately 0, by design
    • This worked well for my test data set, but as soon as I expanded to the full set (30 million points) it took far too long, probably because we have to sort the full list O(NlogN) and then iterate over it until all points have been assigned O(NK) and then repeat that until convergence
  • Linear Programming
    • This was quite simple to implement using libraries, but also took far too long again because of the complexity

I am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that.

Let me know if I missed any information out.

like image 885
Adam Dadvar Avatar asked Feb 08 '19 17:02

Adam Dadvar


People also ask

Can we use clustering for dimensionality reduction?

To reduce the dimensionality of your data, you need to use fewer clusters than the number of original dimensions in the data.

What is the main challenge when using clustering approaches?

The speed at which data is generated is another clustering challenge data scientists face. This problem isn't limited to the volume of data on a network. As networks generate new data at unprecedented speeds, they will have a harder time extracting it in real-time.

What are the major issues in clustering?

Clustering high-dimensional data has many challenges. These include the distance between points converging, the output becoming impossible to visualize, correlation skewing the location of the points, and the local feature relevance problem. The curse of dimensionality will come up repeatedly in data science.


1 Answers

Since you have both pieces already, my first new suggestion would be to partition the points with k-means for k = n/6400 (you can tweak this parameter) and then use integer programming on each super-cluster. When I get a chance I'll write up my other suggestion, which involves a randomly shifted quadtree dissection.

Old pre-question-edit answer below.


You seem more concerned with minimizing equipment and running time than having the tightest possible clusters, so here's a suggestion along those lines.

The idea is to start with 1-node clusters and then use (almost) perfect matchings to pair clusters with each other, doubling the size. Do this 6 times to get clusters of 64.

To compute the matching, we use the centroid of each cluster to represent it. Now we just need an approximate matching on a set of points in the Euclidean plane. With apologies to the authors of many fine papers on Euclidean matching, here's an O(n log n) heuristic. If there are two or fewer points, match them in the obvious way. Otherwise, choose a random point P and partition the other points by comparing their (alternate between x- and y-) coordinate with P (as in kd-trees), breaking ties by comparing the other coordinate. Assign P to a half with an odd number of points if possible. (If both are even, let P be unmatched.) Recursively match the halves.

like image 84
David Eisenstat Avatar answered Sep 22 '22 14:09

David Eisenstat