The usual way to speed up an application is to parallelize an application using MPI or higher level libraries like PETSc which use MPI under the hood.
However nowadays everyone seems to be interested in using CUDA for parallelizing their application or using a hybrid of MPI and CUDA for more ambitious/larger problems.
Is there any noticeable advantage in using a hybrid MPI+CUDA programming model over the traditional , tried and tested MPI model of parallel programming? I am asking this specifically in the application domains of particle methods
One reason why I am asking this question is that everywhere on the web I see the statement that "Particle methods map naturally to the architecture of GPU's" or some variation of this. But never do they seem to justify why I would be better of using CUDA than using just MPI for the same job.
Which parallelising technique (OpenMP/MPI/CUDA) would you prefer more? OpenMP is mostly famous for shared memory multiprocessing programming. MPI is mostly famous for message-passing multiprocessing programming. CUDA technology is mostly famous for GPGPU computing and parallelising tasks in Nvidia GPUs.
Regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy. With Kepler class and later GPUs & Hyper-Q, multiple MPI processes can share the GPU.
MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters.
CUDA is a parallel computing platform and programming model for general computing on graphical processing units (GPUs). With CUDA, you can speed up applications by harnessing the power of GPUs.
But GPU with CUDA gives good efficiency then cluster-system with MPI. There are also libraries that combine CUDA with OpenMP or MPI. for beginners in parallel programming,OpenMP is easy and best .cuda is well suited /efficient for large and complex problem.If you looking for performance of your application then go with hybrid i.e OpenMP+ MPI+CUDA.
A user of a non-CUDA-aware MPI library could implement a more efficient pipeline using CUDA streams and asynchronous memory copies to speed up the communication. Even so, a CUDA-aware MPI can more efficiently exploit the underlying protocol and can automatically utilize the GPUDirect acceleration technologies.
A CUDA-aware MPI implementation must handle buffers differently depending on whether it resides in host or device memory. An MPI implementation could offer different APIs for host and device buffers, or it could add an additional argument indicating where the passed buffer lives.
CUDA can be very fast, but for some kind of applications. Data transfer in CUDA is often the bottleneck. MPI is suitable for cluster environment and large scale network of computers. OpenMP is more suitable for multi-core systems. So it's speed depends on the number of cores.
This is a bit apples and oranges.
MPI and CUDA are fundamentally different architectures. Most importantly, MPI lets you distribute your application over several nodes, while CUDA lets you use the GPU within the local node. If in an MPI program your parallel processes take too long to finish, then yes, you should look into how they could be sped up by using the GPU instead of the CPU to do their work. Conversely, if your CUDA application still takes too long to finish, you may want to distribute the work to multiple nodes using MPI.
The two technologies are pretty much orthogonal (assuming all the nodes on your cluster are CUDA-capable).
Just to build on the other poster's already good answer, some high-level discussion of what kinds of problems GPUs are good at, and why.
GPUs have followed a dramatically different design path from CPUs, because of their distinct origins. Compared to CPU cores, GPU cores contain more ALUs and FP hardware and less control logic and cache. This means that GPUs can provide more efficiency for straight computations, but only code with regular control flow and smart memory access patterns will see the best benefit: up to over a TFLOPS for SP FP code. GPUs are designed to be high-throughput, high-latency devices at control and memory levels. Globally accessible memory has a long, wide bus, so that coalesced (contiguous and aligned) memory accesses achieve good throughput despite long latency. Latencies are hidden by requiring massive thread-parallelism and providing essentially zero-overhead context switching by the hardware. GPUs employ an SIMD-like model, SIMT, whereby groups of cores execute in SIMD lockstep (different groups being free to diverge), without forcing the programmer to reckon with this fact (except to achieve best performance: on Fermi, this could make a difference of up to 32x). SIMT lends itself to the data parallel programming model, whereby data independence is exploited to perform similar processing on a large array of data. Efforts are being made to generalize GPUs and their programming model, as well as to ease programming for good performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With