Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run multiple models of an ensemble in parallel with PyTorch

My neural network has the following architecture:

input -> 128x (separate fully connected layers) -> output averaging

I am using a ModuleList to hold the list of fully connected layers. Here's how it looks at this point:

class MultiHead(nn.Module):
    def __init__(self, dim_state, dim_action, hidden_size=32, nb_heads=1):
        super(MultiHead, self).__init__()

        self.networks = nn.ModuleList()
        for _ in range(nb_heads):
            network = nn.Sequential(
                nn.Linear(dim_state, hidden_size),
                nn.Tanh(),
                nn.Linear(hidden_size, dim_action)
            )
            self.networks.append(network)

        self.cuda()
        self.optimizer = optim.Adam(self.parameters())

Then, when I need to calculate the output, I use a for ... in construct to perform the forward and backward pass through all the layers:

q_values = torch.cat([net(observations) for net in self.networks])

# skipped code which ultimately computes the loss I need

self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

This works! But I am wondering if I couldn't do this more efficiently. I feel like by doing a for...in, I am actually going through each separate FC layer one by one, while I'd expect this operation could be done in parallel.

like image 452
MasterScrat Avatar asked Oct 14 '19 10:10

MasterScrat


People also ask

How do I use a parallel GPU in PyTorch?

To use data parallelism with PyTorch, you can use the DataParallel class. When using this class, you define your GPU IDs and initialize your network using a Module object with a DataParallel object. Then, when you call your object it can split your dataset into batches that are distributed across your defined GPUs.

How does PyTorch parallel work?

Distributed Data Parallel in PyTorchDDP uses multiprocessing instead of threading and executes propagation through the model as a different process for each GPU. DDP duplicates the model across multiple GPUs, each of which is controlled by one process. A process here can be called a script that runs on your system.

Is PyTorch parallel?

Data parallelism can be achieved in pure PyTorch by using the DataParallel layer. The PyTorch tutorial describes this class in a section of its tutorial. DataParallel can be applied to any PyTorch module (any nn. Module object, which can vary from "a layer" to "a model").

What is model parallelism?

In model parallelism, every model is partitioned into 'N' parts, just like data parallelism, where 'N' is the number of GPUs. Each model is then placed on an individual GPU. The batch of GPUs is then calculated sequentially in this manner, starting with GPU#0, GPU#1 and continuing until GPU#N.

How can we achieve model parallelism with PyTorch?

We would be using the RoBERTa-Large model from Hugging Face Transformers. This approach helps achieve Model Parallelism just with PyTorch and without using any PyTorch wrappers such as Pytorch-Lightning. Lastly, we would be using the IMDB dataset of 50K Movie Reviews for fine-tuning our RoBERTa-Large model

Is it possible to perform ensemble inference on multiple GPUs?

I hope to perform the ensemble inference on a same validation data on multiple GPUs (i.e. 4 GPUS). Originally, there was some data parallellism in this framework, and if I just used one single model for inference, it worked well with utility for all 4 GPUs above 85%.

How many cores does PyTorch use for neural net?

The fact is that for PyTorch 1.3.0 and 1.3.1, the basic neural net functions (model evaluation, backward differentiation, optimization stepping) are all optimized to use all available cores. So in the case of one process thread, all 16 cores are dividing the work.

What is model parallelism and how do I implement it?

This tutorial will help you implement Model Parallelism (splitting the model layers into multiple GPUs) to help train larger models over multiple GPUs. We would be using the RoBERTa-Large model from Hugging Face Transformers.


1 Answers

In the case of Convnd in place of Linear you could use the groups argument for "grouped convolutions" (a.k.a. "depthwise convolutions"). This let's you handle all parallel networks simultaneously.

If you use a convolution kernel of size 1, then the convolution does nothing else than applying a Linear layer, where each channel is considered an input dimension. So the rough structure of your network would look like this:

  1. Modify the input tensor of shape B x dim_state as follows: add an additional dimension and replicate by nb_state-times B x dim_state to B x (dim_state * nb_heads) x 1
  2. replace the two Linear with
nn.Conv1d(in_channels=dim_state * nb_heads, out_channels=hidden_size * nb_heads, kernel_size=1, groups=nb_heads)

and

nn.Conv1d(in_channels=hidden_size * nb_heads, out_channels=dim_action * nb_heads, kernel_size=1, groups=nb_heads)
  1. we now have a tensor of size B x (dim_action x nb_heads) x 1 you can now modify it to whatever shape you want (e.g. B x nb_heads x dim_action)

While CUDA natively supports grouped convolutions, there were some issues in pytorch with the speed of grouped convolutions (see e.g. here) but I think that was solved now.

like image 83
flawr Avatar answered Oct 29 '22 18:10

flawr