I'm building a network-distributed parallel processing application that uses a combination of CPU and GPU resources across many machines. The app has to perform some very computationally expensive operations on a very large dataset over thousands of iterations: <pre class="prettyprint"><code>for step = 0 to requested_iterations for i = 0 to width for j = 0 to height for k = 0 to depth matrix[i,j,k] = G*f(matrix[i,j,k]) </code></pre> Also, the matrix operations have to be executed synchronously: that is, each iteration depends on the results of the frame that came immediately before it. The hardware available in this ad-hoc grid, comprising both dedicated servers and idle desktop machines, varies greatly in performance from machine to machine. I'm wondering what the best way is to balance the work load across the entire system. Some idiosyncracies: <ol> <li>The grid should be as robust as possible. Some simulations require weeks to run, and it would be nice not to have to cancel a run if one out of 100 machines goes offline.</li> <li>Some of the lower-end machines (desktops that are idle but have to wake up when someone logs in) may join and leave the grid at any time. </li> <li>The dedicated servers may also join and leave the grid, but this is predictable.</li> </ol> So far, what the best idea I've been able to come up with is: <ol> <li>Have each node track the time itself takes to process a group of n cells in the matrix (cells processed per unit time) and report this to a central repository.</li> <li>Weight this time against the total time for a frame (across the entire grid) of the simulation and the total size of the problem domain. So, each node would get a score expressed in work units (matrix cells) per time, and a scalar rating expressing its performance vs the rest of the grid.</li> <li>On each frame, distribute the work load based on those scores so that each machine finishes as close to the same time as possible. If machine <code>A</code> is 100x faster than machine <code>B</code>, it will receive 100x as many matrix cells to process in a given frame (assuming that the matrix size is large enough to warrant including the extra machines). </li> <li>Nodes that leave the grid (desktops that are logged into, etc.) will have their workload redistributed among the remaining nodes.</li> </ol> Or, Arrange the nodes in a tree structure, where each node has a "weight" assigned. Nodes that are higher in the tree have a weight based on their ability combined with that of their children. This weight is adjusted per frame. When a node loses communication its child, it uses a cached tree graph to contact the orphaned children and re-balance its branch. If it makes a difference, the app is a combination of C# and OpenCL. Links to papers, example apps, and especially tutorials are welcome. Edit This isn't homework. I'm turning a simulator I wrote as part of my thesis into a more useful product. Right now the work is distributed uniformly with no accounting for performance of each machine, and no facility to recover from machines joining or leaving the grid. Thanks for the excellent, detailed responses.

For heterogeneous clusters, I like to let each processor request a new job as the processor becomes available. Implementation involves a light weight server that can handle many requests at a time (but usually only returns a job number). Implementation might go something like this: <ul> <li>Break the job down into its smallest components (we know there are 1000 tasks now)</li> <li>Start a network server (preferably UDP with timeouts to avoid network congestion) which counts upwards</li> <li>Start your cluster processes.</li> <li>Each process asks, "What job number should I perform?" and the server replies with a number</li> <li>As the process finishes, it asks for the next job number. When all tasks are complete, the server returns a -1 to the processes, so they shut down.</li> </ul> This is a lighter weight alternative to what you suggest above. Your fast processors still do more work than your slower machines, but you don't have to calculate how long the tasks take. If a processor drops out for whatever reason, it will stop asking for tasks. Your server could choose to recycle task numbers after a certain amount of time. This is pretty much what a cluster scheduler would do on its own, except the processors don't have startup and shutdown costs, so your individual tasks can be smaller without penalty.

Load-balancing in parallel processing application

Tags:

c#

parallel-processing

load-balancing

network-programming

opencl

I'm building a network-distributed parallel processing application that uses a combination of CPU and GPU resources across many machines.

The app has to perform some very computationally expensive operations on a very large dataset over thousands of iterations:

for step = 0 to requested_iterations
  for i = 0 to width
    for j = 0 to height
      for k = 0 to depth
        matrix[i,j,k] = G*f(matrix[i,j,k])

Also, the matrix operations have to be executed synchronously: that is, each iteration depends on the results of the frame that came immediately before it.

The hardware available in this ad-hoc grid, comprising both dedicated servers and idle desktop machines, varies greatly in performance from machine to machine. I'm wondering what the best way is to balance the work load across the entire system.

Some idiosyncracies:

The grid should be as robust as possible. Some simulations require weeks to run, and it would be nice not to have to cancel a run if one out of 100 machines goes offline.
Some of the lower-end machines (desktops that are idle but have to wake up when someone logs in) may join and leave the grid at any time.
The dedicated servers may also join and leave the grid, but this is predictable.

So far, what the best idea I've been able to come up with is:

Have each node track the time itself takes to process a group of n cells in the matrix (cells processed per unit time) and report this to a central repository.
Weight this time against the total time for a frame (across the entire grid) of the simulation and the total size of the problem domain. So, each node would get a score expressed in work units (matrix cells) per time, and a scalar rating expressing its performance vs the rest of the grid.
On each frame, distribute the work load based on those scores so that each machine finishes as close to the same time as possible. If machine A is 100x faster than machine B, it will receive 100x as many matrix cells to process in a given frame (assuming that the matrix size is large enough to warrant including the extra machines).
Nodes that leave the grid (desktops that are logged into, etc.) will have their workload redistributed among the remaining nodes.

Or,

Arrange the nodes in a tree structure, where each node has a "weight" assigned. Nodes that are higher in the tree have a weight based on their ability combined with that of their children. This weight is adjusted per frame. When a node loses communication its child, it uses a cached tree graph to contact the orphaned children and re-balance its branch.

If it makes a difference, the app is a combination of C# and OpenCL.

Links to papers, example apps, and especially tutorials are welcome.

Edit

This isn't homework. I'm turning a simulator I wrote as part of my thesis into a more useful product. Right now the work is distributed uniformly with no accounting for performance of each machine, and no facility to recover from machines joining or leaving the grid.

Thanks for the excellent, detailed responses.

607

asked Aug 26 '11 15:08

3Dave

2 Answers

For heterogeneous clusters, I like to let each processor request a new job as the processor becomes available. Implementation involves a light weight server that can handle many requests at a time (but usually only returns a job number). Implementation might go something like this:

Break the job down into its smallest components (we know there are 1000 tasks now)
Start a network server (preferably UDP with timeouts to avoid network congestion) which counts upwards
Start your cluster processes.
Each process asks, "What job number should I perform?" and the server replies with a number
As the process finishes, it asks for the next job number. When all tasks are complete, the server returns a -1 to the processes, so they shut down.

This is a lighter weight alternative to what you suggest above. Your fast processors still do more work than your slower machines, but you don't have to calculate how long the tasks take. If a processor drops out for whatever reason, it will stop asking for tasks. Your server could choose to recycle task numbers after a certain amount of time.

This is pretty much what a cluster scheduler would do on its own, except the processors don't have startup and shutdown costs, so your individual tasks can be smaller without penalty.

113

answered Sep 22 '22 12:09

michael

I would go for decentralized solution.

Every node picks (not given) same amount of work from center. After some run every node is able to deside for itself an average power of calculation and communicate it with others.

After all every node will have a table of every node's average calc power. Having this information (could be even persistant,why not?) each node can deside to "ask" some other node with more power to delegate a stuff to it by signing a contract.

Before every process start every node have to make broadcast signal about: "I start doing X". One time finished always broadcast: "I finished X".

Well, it's no so easy, cause there will be case when you begin job, after your hard disk failed and you will never finish it. Others, especially those ones who are waiting a result from you should figure out this and pick from the basket your job and begin the stuff from the beginning. Here come "ping" technique with timer.

Bad: The first tuning time can take non indifferent amount of time.

Good: You will have almost fault tolerant solution. Leave them for a week, and even if some of nodes fail your grid still alive and does its work.

Many years ago I did something like this and with pretty good results. But it wasn't definitely on such large scale as described by you. And scale, actually, makes a difference.

So the choice is up to you.

Hope this helps.

answered Sep 23 '22 12:09

Tigran

Related questions
                            
                                Unable to load DLL 'amqxcs2.dll': The specified module could not be found. Works in Visual Studio but not IIS
                            
                                .net: Version numbers for DLL vs EXE?
                            
                                WCF Web Service serialization error - returning null values
                            
                                Force redundant name qualifier for static methods in Resharper
                            
                                How to map a C union with const char* to C# struct?
                            
                                Efficiently sort an IList<T> without copying the source list
                            
                                How can I find all rectangles that bound regions in a bitmap?
                            
                                How To Mock/Stub a Nhibernate QueryOver Call?
                            
                                Stream into package, package into WordDocument and then back again
                            
                                How to export FlowDocument to DOC(x) or XLS
                            
                                Is there a plug-in for Visual Studio 2010 to query XML file using LINQ/XPATH?
                            
                                Permissions for roles in .NET
                            
                                C# - Using Vim as Primary Editor [closed]
                            
                                C# Singleton pattern with triggerable initialization
                            
                                Use of RegisterClientScriptBlock/RegisterStartupScript in asp.net 3.5
                            
                                Is this byte array a password-protected PDF document?
                            
                                calling return RedirectToAction("Activity") outside controller
                            
                                Using Windsor to automatically subscribe to event aggregator with custom facility
                            
                                monitor file/directory access in C#
                            
                                WCF Certificate Chain, verify programmatically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With