Any Advantage of MPI+CUDA over just pure MPI?

Tags:

The usual way to speed up an application is to parallelize an application using MPI or higher level libraries like PETSc which use MPI under the hood.

However nowadays everyone seems to be interested in using CUDA for parallelizing their application or using a hybrid of MPI and CUDA for more ambitious/larger problems.

Is there any noticeable advantage in using a hybrid MPI+CUDA programming model over the traditional , tried and tested MPI model of parallel programming? I am asking this specifically in the application domains of particle methods

One reason why I am asking this question is that everywhere on the web I see the statement that "Particle methods map naturally to the architecture of GPU's" or some variation of this. But never do they seem to justify why I would be better of using CUDA than using just MPI for the same job.

559

asked Nov 09 '11 06:11

curiousexplorer

2 Answers

This is a bit apples and oranges.

MPI and CUDA are fundamentally different architectures. Most importantly, MPI lets you distribute your application over several nodes, while CUDA lets you use the GPU within the local node. If in an MPI program your parallel processes take too long to finish, then yes, you should look into how they could be sped up by using the GPU instead of the CPU to do their work. Conversely, if your CUDA application still takes too long to finish, you may want to distribute the work to multiple nodes using MPI.

The two technologies are pretty much orthogonal (assuming all the nodes on your cluster are CUDA-capable).

119

answered Sep 23 '22 14:09

suszterpatt

Just to build on the other poster's already good answer, some high-level discussion of what kinds of problems GPUs are good at, and why.

GPUs have followed a dramatically different design path from CPUs, because of their distinct origins. Compared to CPU cores, GPU cores contain more ALUs and FP hardware and less control logic and cache. This means that GPUs can provide more efficiency for straight computations, but only code with regular control flow and smart memory access patterns will see the best benefit: up to over a TFLOPS for SP FP code. GPUs are designed to be high-throughput, high-latency devices at control and memory levels. Globally accessible memory has a long, wide bus, so that coalesced (contiguous and aligned) memory accesses achieve good throughput despite long latency. Latencies are hidden by requiring massive thread-parallelism and providing essentially zero-overhead context switching by the hardware. GPUs employ an SIMD-like model, SIMT, whereby groups of cores execute in SIMD lockstep (different groups being free to diverge), without forcing the programmer to reckon with this fact (except to achieve best performance: on Fermi, this could make a difference of up to 32x). SIMT lends itself to the data parallel programming model, whereby data independence is exploited to perform similar processing on a large array of data. Efforts are being made to generalize GPUs and their programming model, as well as to ease programming for good performance.

answered Sep 22 '22 14:09

Patrick87

Related questions
                            
                                Matching two people together based on attributes
                            
                                Kasai Algorithm for Constructing LCP-Array Practical Example
                            
                                N-Queens II using backtracking is slow
                            
                                Algorithm for determining a file's identity
                            
                                Changing speed of a sound file
                            
                                What is a "safety variable"?
                            
                                What is wrong with the pearson algorithm from “Programming Collective Intelligence”?
                            
                                calculating parameters for defining subsections of quadratic bezier curves
                            
                                Apply algorithms considering a specific edge subset
                            
                                What is the best algorithm for closest word
                            
                                How can I take the modulus of two very large numbers?
                            
                                How to compare rational numbers?
                            
                                Which one is the real Bubble Sort, and which one is better?
                            
                                iterative algorithm for combination generation [duplicate]
                            
                                Algorithm to combine / merge date ranges
                            
                                analysing time complexity of my programs
                            
                                How to merge similar items in a list
                            
                                Example of loop using pointers rewritten using an STL algorithm, without a loop?
                            
                                Algorithm for drawing lines on the plane
                            
                                Majority Voting Algorithm - WRONG?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Any Advantage of MPI+CUDA over just pure MPI?

Tags:

algorithm

cuda

mpi

curiousexplorer

People also ask

2 Answers

suszterpatt

Patrick87

Recent Activity

Donate For Us