Given an array [x1, x2, x3, ..., xk ] where xi is the number of items in box i, how can I redistribute the items so that no box contains more than N items. N is close to sum(xi)/k -- That is, N is close to every box having the same number of items. Intermediate boxes shouldn't be used to carry items -- if x1 has a surplus and x2 and x3 have deficits, x1 should send some items to x2 and to x3, but not send all its items to x2 and then let x2 resolve the surplus.
The actual problem is: each computing node has a set of samples, and after a resampling step some computer nodes might have a surplus while others have a deficit, so I'd like to re-distribute the samples while minimizing communication.
I imagine this sort of problem has a name.
This problem could be modeled as an instance of minimum-cost flow.
Let G be a directed graph with vertices s, t, a1, …, ak, b1, …, bk and arcs s→ai of capacity xi and cost 0, arcs ai→bj of infinite capacity and cost 0 if i = j and cost 1 if i ≠ j, and arcs bj→t of capacity N and cost 0. We want the minimum-cost flow that moves ∑i xi units from s to t. The idea is that if y units flow ai→bj, then y items are moved from box i to box j (if i ≠ j; if i = j, then no move occurs).
Using minimum-cost flow to solve this simple problem is of course using a sledgehammer to crack a nut, but several variants can be modeled as network flow problems.
If a pair of nodes i, j is not directly connected, remove the ai→bj arc.
We can start up and shut down nodes by giving them vertices on only the "a" side or the "b" side.
We can model differences in communication speeds by adjusting the costs from uniform.
We can limit the number of items two nodes exchange by limiting the capacity of their arc.
We can introduce more interior vertices and change the complete bipartite topology of the connections to model the effects of network contention.
I don't believe your problem (as stated) is complex enough to have attracted independent study. If the machine count (call it C
) is in the thousands, and your sample counts are even trillions large, then it is trivial to send the sample counts to a coordinating master node.
The master can then trivially (O(C)
) compute N
, identify nodes violating this bound, and by how much. Notice that the sum of the excesses over the bound is exactly the minimum amount of communication required. By inserting a slack parameter when calculating N
(i.e. by accepting unbalance), you can control this quantity.
Using a sort, the ordering of nodes by sample count can be had in O(C log C)
. Walk two cursors, one from the either end, towards the middle. At each step, schedule a transfer from the large node to the small node, sized at the minimum of the remaining excess in the large, or remaining slack in the small. Advance the pointer of whichever node had the active constraint in the last sentence and repeat until the new large has no excess. (I believe this is the greedy algorithm @Noxville was getting at.)
Assuming N
is greater than the mean count of samples per node, it is trivial to see this is ideal w.r.t. minimum communication.
If your network of machines has any constraints, either as absent links or maximum flow or variable cost across an edge, then you need to break out the graph flow stuff. However you didn't mention anything like this, and computer clusters within the same data center typically have none.
It sounds like you want to use consistent hashing
http://en.wikipedia.org/wiki/Consistent_hashing
Basically using a good random hash function will allow you to get unique ids for your samples that give an even distribution. It is then easy to distribute the samples across a set of nodes consistently.
See
http://www.lexemetech.com/2007/11/consistent-hashing.html http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
for more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With