MPI + GPU : how to mix the two techniques

Tags:

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.

But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.

I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?

Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.

Here is a schematic example of what I'm talking about:

Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.

767

asked Apr 09 '12 13:04

cmo

2 Answers

Here I have found some news about the topic:

"MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it."

155

answered Sep 18 '22 15:09

Leos313

This isn't the way to think about these things.

I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.

Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.

(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.

answered Sep 18 '22 15:09

Jonathan Dursi

Related questions
                            
                                using MPI with docker containers
                            
                                Implications of using MPI with TensorFlow
                            
                                Difference between mpif90 and mpifort
                            
                                InfiniBand: transfer rate depends on MPI_Test* frequency
                            
                                Vector Usage in MPI(C++)
                            
                                Fortran error: type mismatch between two unrelated subroutine calls
                            
                                Processor/socket affinity in openMPI?
                            
                                What is the advantage (if any) of MPI + threads parallelization vs. MPI-only?
                            
                                MPI_Recv - How to determine count?
                            
                                How is barrier implemented in message passing systems?
                            
                                MPI implementation for Java
                            
                                Why would my parallel code be slower than my serial code?
                            
                                Do I need to have a corresponding MPI::Irecv for an MPI::Isend?
                            
                                mpi4py Send/Recv with tag
                            
                                MPI: Change number of processors in CMakelists
                            
                                OpenMPI MPI_Barrier problems
                            
                                How to compile an MPI included c program using cmake
                            
                                Speed up processing from CSV file
                            
                                MPI and D: Linker Options
                            
                                TensorFlow Horovod: NCCL and MPI

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With