TensorFlow Horovod: NCCL and MPI

Tags:

Horovod is combining NCCL and MPI into an wrapper for Distributed Deep Learning in for example TensorFlow. I haven't heard of NCCL previously and was looking into its functionality. The following is stated about NCCL on the NVIDIA website:

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.

From the introduction video about NCCL I understood that NCCL works via PCIe, NVLink, Native Infiniband, Ethernet and it can even detect if GPU Direct via RDMA makes sense in the current hardware topology and uses it transparently.

So I am questioning why MPI is needed in Horovod? As far as I understand, MPI is also used for efficiently exchanging the gradients among distributed nodes via an allreduce paradigm. But as I understand, NCCL already supports those functionalities.

So is MPI only used for easily scheduling the jobs on a cluster? For Distributed Deep Learning on CPU, since we cannot use NCCL there?

I would highly appreciate if someone could explain in which scenarios MPI and/or NCCL is used for Distributed Deep Learning and what are their responsibilities during the training job.

553

asked Nov 27 '18 11:11

Alex

2 Answers

MPI (Message Passing Interface) is a message-passing standard used in parallel computing (Wikipedia). Most of the time, you'd use Open MPI when using Horovod, which is an open-source implementation of the MPI standard.

The MPI implementation allows one to easily run more than a single instance of a program in parallel. The program code is kept the same but just running in a few different processes. In addition, the MPI library exposes an API to easily share data and information among these processes.

Horovod uses this mechanism in order to run some processes of the Python script which is running the neural network. These processes should know and share some information during the running of the neural network. Some of this information is about the environment, for example:

The number of processes that are currently being running, for being able to correctly modify parameters and hyperparameters for the neural network such as the batch size, learning rate, etc..
Knowing which process is the "master" one, to print logs and save files (checkpoints) from only a single process.
The id (called "rank") of the current process so it could use a specific area of the input data.

Some of this information is about the training process of the neural network, for example:

The randomized initial values for the weights and biases of the model, so all processes will start from the same point.
The values of the weights and biases at the end of every training step, so all processes will start the next step with the same values.

There is more information that is shared and the above bullets are some of it.

At first, Horovod used MPI for all the requirements above. Then, Nvidia released NCCL which is a library that consists of many algorithms for high-performance communication between GPUs. To improve the overall performance, Horovod started using NCCL for things like (4) and mainly (5) as NCCL allowed sharing this data between GPUs much more efficiently.

In Nvidia docs we can see that NCCL can be used in conjunction with MPI, and in general:

MPI is used for CPU-CPU communication, and NCCL is used for GPU-GPU communication.

Horovod still uses MPI for running the few instances of the Python script and manage the environment (rank, size, which process is the "master", etc..) for allowing the user to easily manage the run.

187

answered Oct 02 '22 14:10

Raz Rotenberg

Firstly, horovod used MPI only in the beginning.

After NCCL is introduced to horovod, even in NCCL mode, MPI is still used for providing environmental info (rank, size and local_rank). NCCL doc has an example shows how it leverages MPI in one device per process setting:

The following code is an example of a communicator creation in the context of MPI, using one device per MPI rank.

https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/examples.html#example-2-one-device-per-process-or-thread

answered Oct 02 '22 14:10

eval

Related questions
                            
                                Conda showing two versions of requests library
                            
                                How to use numpy functions on a keras tensor in the loss function?
                            
                                Overly large .exe file when using pyinstaller
                            
                                Python bug: null byte in input prompt
                            
                                What are exactly the standard streams if there's no terminal/console window open for the python interpreter?
                            
                                Django admin: Inline straight to second-level relationship
                            
                                PyTorch Linear Algebra Gradients
                            
                                Setting stdout to non-blocking in python
                            
                                difference in predictions between model.predict() and model.predict_generator() in keras
                            
                                Unable to connect to Hive2 using Python
                            
                                How to download pip packages for a different operating system?
                            
                                Why use more than one equal sign in a statement with the same variable?
                            
                                Python socket connect() vs. connect_ex()
                            
                                ENIGMA CATALYST - WARNING: Loader: Refusing to download new treasury data because a download succeeded
                            
                                How do I re-use trained fastai models?
                            
                                Boost.Python return python object which references to existing c++ objects
                            
                                Lambda expression in cython function
                            
                                Why doesn't numpy.zeros allocate all of its memory on creation? And how can I force it to?
                            
                                How to Improve OCR on image with text in different colors and fonts?
                            
                                VSCode python debug: "No module named xx" when using module attribute

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TensorFlow Horovod: NCCL and MPI

Tags:

python

tensorflow

deep-learning

mpi

Alex

People also ask

2 Answers

Raz Rotenberg

eval

Recent Activity

Donate For Us