Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping arbitrary arrays of data into N bins

I want to group an arbitrary-sized array of random values into n groups, such that the sum of values in any one group/bin is as equal as possible.

So for values [1, 2, 4, 5] and n = 2, the output buckets should be [sum(5+1), sum(4+2)].

Some possibilities that occur to me:

  • Full exhaustive breadth first search
  • Random processes with stopping conditions hard coded
  • Start from one end of the sorted array, grouping until the sum is equal to the global average, and move to the next group until n is reached

Seems like the optimal solution (where the sum of the contents of the bins are as equal as possible given the input array) is probably non-trivial; so at the moment I'm leaning towards the last option, but have the feeling I am possibly missing more elegant solutions?

like image 481
malangi Avatar asked Mar 02 '12 23:03

malangi


1 Answers

This is an NP-hard problem. In other words, it's not possible to find an optimal solution without exploring all combinations, and the number of combinations is n^M (where M is the size of you array, and n the number of beans). It's a problem very similar to clustering, which is also NP-hard.

If your data set is small enough to deal with, a brute force algorithm is best (explore all combinations).

However, if your data set is big, you'll want a polynomial-time algorithm that won't get you the optimal solution, but a good approximation. In that case, I suggest you use something similar to K-Means...

Step 1. Calculate the expected sum per bin. Let A be your array, then the expected sum per bin is SumBin = SUM(A) / n (the sum of all elements in your array over the number of bins).

Step 2. Put all elements of your array in some collection (e.g. another array) that we'll call The Bag (this is just a conceptual, so you understand the next steps).

Step 3. Partition The Bag into n groups (preferably randomly, so that each element ends up in some bin i with probability 1/n). At this point, your bins have all the elements, and The Bag is empty.

Step 4. Calculate the sum for each bin. If result is the same as last iteration, exit. (this is the expectation step of K-Means)

Step 5. For each bin i, if its sum is greater than SumBin, pick the first element greater than SumBin and put it back in The Bag; if its sum is less than SumBin, pick the first element less than SumBin and put back in The Bag. This is the gradient descent step (aka maximization step) of K-Means.

Step 6. Go to step 3.

This algorithm is just an approximation, but it's fast and guaranteed to converge.

If you are skeptical about a randomized algorithm like the above, after the first iteration when you are back to step 3, instead of assigning elements randomly, you can do so optimally by running the Hungarian algorithm, but I am not sure that will guarantee better over-all results.

like image 79
Diego Avatar answered Oct 05 '22 01:10

Diego