Parallel Demonstration Program

Question

An assignment that I've just now completed requires me to create a set of scripts that can configure random Ubuntu machines as nodes in an MPI computing cluster. This has all been done and the nodes can communicate with one another properly. However, I would now like to demonstrate the efficiency of said MPI cluster by throwing a parallel program at it. I'm just looking for a straight brute force calculation that can divide up work among the number of processes (=nodes) available: if one node takes 10 seconds to run the program, 4 nodes should only take about 2.5.

With that in mind I looked for a prime calculation programs written in C. For any purists, the program is not actually part of my assignment as the course I'm taking is purely systems management. I just need anything that will show that my cluster is working. I have some programming experience but little in C and none with MPI. I've found quite a few sample programs but none of those seem to actually run in parallel. They do distribute all the steps among my nodes so if one node has a faster processor the overall time will go down, but adding additional nodes does nothing to speed up the calculation.

Am I doing something wrong? Are the programs that I've found simply not parallel? Do I need to learn C programming for MPI to write my own program? Are there any other parallel MPI programs that I can use to demonstrate my cluster at work?

EDIT

Thanks to the answers below I've managed to get several MPI scripts working, among which the sum of the first N natural numbers (which isn't very useful as it runs into data type limits), the counting and generating of prime numbers and the Monte Carlo calculation of Pi. Interestingly only the prime number programs realise a (sometimes dramatic) performance gain with multiple nodes/processes.

The issue that caused most of my initial problems with getting scripts working was rather obscure and apparently due to issues with hosts files on the nodes. Running mpiexec with the -disable-hostname-propagation parameter solved this problem, which may manifest itself in a variety of ways: MPI(R) barrier errors, TCP connect errors and other generic connection failures. I believe it may be necessary for all nodes in the cluster to know one another by hostname, which is not really an issue in classic Beowulf clusters that have DHCP/DNS running on the server node.

Andreas Grapentin · Accepted Answer

The usual proof of concept application in parallel programming is simple raytracing.

That being said, I don't think that raytracing is a good example to show off the power of OpenMPI. I'd put the emphasis on scatter/gather or even better scatter/reduce, because that's where MPI gets the true power :)

the most basic example for that would be calculating the sum over the first N integers. You'll need to have a master thread, that fits value ranges to sum over into an array, and scatter these ranges over the number of workers.

Then you'll need to do a reduction and check your result against the explicit formula, to get a free validation test.

If you're looking for a weaker spot of MPI, a parallel grep might work, where IO is the bottleneck.

EDIT

You'll have to keep in mind that MPI is based on a shared nothing architecture where the nodes communicate using messages, and that the number of nodes is fixed. these two factors set a very tight frame for the programs that run on it. To make a long story short, this kind of parallelism is great for data-parallel applications, but sucks for task-parallel applications, because you can usually distribute data better than tasks if the number of nodes changes.

Also, MPI has no concept of implicit work-stealing. if a node is finished working, it just sits around waiting for the other nodes to finish. that means, you'll have to figure out weakest-link handling yourself.

MPI is very customizable when it comes to performance details, there are numerous different variants of MPI_SEND, for example. That leaves much room for performance tweaking, which is important for high performance computing, for which MPI was designed, but is mostly confusing "ordinary" programmers, leading to programs that actually get slower when run parallel. maybe your examples just suck :)

And on the scaleup / speedup problem, well...

I suggest that you read into Amdahl's Law, and you'll see that it's impossible to get linear speedup by just adding more nodes :)

I hope that helped. If you still have questions, feel free to drop a comment :)

EDIT2

maybe the best scaling problem that integrates perfectly with MPI is the empiric estimation of Pi.

Imaging a quarter circle with the radius 1, inside a square with sides of length 1, then you can estimate Pi by firing random points into the square and calculate if they're inside of the quarter circle.

note: this is equal to generating tuples (x,y) with x,y in [0, 1] and measuring how many of these have x² + y² <= 1.

Pi is then roughly equal to

4 * Points in Circle / total Points

In MPI you'd just have to gather the ratios generated from all threads, which is very little overhead and thus gives a perfect proof of concept problem for your cluster.

Hristo Iliev · Answer

Like with any other computing paradigm, there are certain well established patterns in use with distributed memory programming. One such pattern is the "bag of jobs" or "controller/worker" (previously known as "master/slave", but now the name is considered politically incorrect). It is best suited for your case because:

under the right conditions it scales with the number of workers;
it is easy to implement;
it has built-in load balancing.

The basic premises are very simple. The "controller" process has a big table/queue of jobs and practically executes one big loop (possibly an infinite one). It listens for messages from "worker" processes and responds back. In the simplest case workers send only two types of messages: job requests or computed results. Consequently, the controller process sends two types of messages: job descriptions or termination requests.

And the canonical non-trivial example of this pattern is colouring the Mandelbrot set. Computing each pixel of the final image is done completely independent from the other pixels, so it scales very well even on clusters with high-latency slow network connects (e.g. GigE). In the extreme case each worker can compute a single pixel, but that would result in very high communication overhead, so it is better to split the image in small rectangles. One can find many ready-made MPI codes that colour the Mandelbrot set. For example this code uses row decomposition, i.e. a single job item is to fill one row of the final image. If the number of MPI processes is big, one would have to have fairly large image dimensions, otherwise the load won't balance well enough.

MPI also has mechanisms that allow spawning additional processes or attaching externally started jobs in client/server fashion. Implementing them is not rocket science, but still requires some understanding of advanced MPI concepts like intercommunicators, so I would skip that for now.

Parallel Demonstration Program

Tags:

c

mpi

openmpi

Lilienthal

2 Answers

Andreas Grapentin

Hristo Iliev

Recent Activity

Donate For Us

Parallel Demonstration Program

Tags:

c

mpi

openmpi

Lilienthal

2 Answers

Andreas Grapentin

Hristo Iliev

Related questions

Recent Activity

Donate For Us