Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does pytorch's parallel method and distributed method work?

I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?

Here is a basic example of DataParallel.

import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np

class Model(nn.Module):
    def __init__(self):
        super().__init__(
            embedding=nn.Embedding(1000, 10),
            rnn=nn.Linear(10, 10),
        )

    def forward(self, x):
        x = self.embedding(x)
        x = self.rnn(x)
        return x

model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()

PyTorch can split the input and send them to many GPUs and merge the results back.

How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.

like image 832
erickg Avatar asked Nov 19 '18 13:11

erickg


People also ask

How does PyTorch distributed Data Parallel work?

DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.

How does PyTorch distributed training work?

With DDP, the model is replicated on every process, and every model replica will be fed with a different set of input data samples. DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.

How does torch distributed work?

The distributed package included in PyTorch (i.e., torch. distributed ) enables researchers and practitioners to easily parallelize their computations across processes and clusters of machines. To do so, it leverages message passing semantics allowing each process to communicate data to any of the other processes.

What is distributed data parallel in PyTorch?

Distributed data parallel is multi-process and works for both single and multi-machine training. In pytorch, nn.parallel.DistributedDataParallel parallelizes the module by splitting the input across the specified devices. This module is suitable for multi-node,multi-GPU training as well.

How do I use parallelism in PyTorch?

Data Parallel in PyTorch If we quickly want to get started with Distributed Training, we can use Data Parallel in PyTorch which uses threading to achieve parallel training. All we need to do is add 1 line as given below in your script and PyTorch will handle the parallelism for us.

What is DDP in PyTorch and how does it work?

DDP in PyTorch does the same thing but in a much proficient way and also gives us better control while achieving perfect parallelism. DDP uses multiprocessing instead of threading and executes propagation through the model as a different process for each GPU. DDP duplicates the model across multiple GPUs, each of which is controlled by one process.

How to use multiple GPUs with PyTorch Dataparallel?

In this tutorial, we will learn how to use multiple GPUs using DataParallel. It’s very easy to use GPUs with PyTorch. You can put the model on a GPU: Then, you can copy all your tensors to the GPU: Please note that just calling my_tensor.to (device) returns a new copy of my_tensor on GPU instead of rewriting my_tensor.


1 Answers

That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.

This container parallelizes the application of the given :attr:module by splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device). In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.

As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.

There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).

Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.

Please refer to literature for more information about that area as it is still developing (as of today).

PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.

PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.

References:

[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective

like image 105
mr_mo Avatar answered Oct 17 '22 23:10

mr_mo