My neural network has the following architecture:
input -> 128x (separate fully connected layers) -> output averaging
I am using a ModuleList to hold the list of fully connected layers. Here's how it looks at this point:
class MultiHead(nn.Module):
def __init__(self, dim_state, dim_action, hidden_size=32, nb_heads=1):
super(MultiHead, self).__init__()
self.networks = nn.ModuleList()
for _ in range(nb_heads):
network = nn.Sequential(
nn.Linear(dim_state, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, dim_action)
)
self.networks.append(network)
self.cuda()
self.optimizer = optim.Adam(self.parameters())
Then, when I need to calculate the output, I use a for ... in
construct to perform the forward and backward pass through all the layers:
q_values = torch.cat([net(observations) for net in self.networks])
# skipped code which ultimately computes the loss I need
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
This works! But I am wondering if I couldn't do this more efficiently. I feel like by doing a for...in
, I am actually going through each separate FC layer one by one, while I'd expect this operation could be done in parallel.
To use data parallelism with PyTorch, you can use the DataParallel class. When using this class, you define your GPU IDs and initialize your network using a Module object with a DataParallel object. Then, when you call your object it can split your dataset into batches that are distributed across your defined GPUs.
Distributed Data Parallel in PyTorchDDP uses multiprocessing instead of threading and executes propagation through the model as a different process for each GPU. DDP duplicates the model across multiple GPUs, each of which is controlled by one process. A process here can be called a script that runs on your system.
Data parallelism can be achieved in pure PyTorch by using the DataParallel layer. The PyTorch tutorial describes this class in a section of its tutorial. DataParallel can be applied to any PyTorch module (any nn. Module object, which can vary from "a layer" to "a model").
In model parallelism, every model is partitioned into 'N' parts, just like data parallelism, where 'N' is the number of GPUs. Each model is then placed on an individual GPU. The batch of GPUs is then calculated sequentially in this manner, starting with GPU#0, GPU#1 and continuing until GPU#N.
We would be using the RoBERTa-Large model from Hugging Face Transformers. This approach helps achieve Model Parallelism just with PyTorch and without using any PyTorch wrappers such as Pytorch-Lightning. Lastly, we would be using the IMDB dataset of 50K Movie Reviews for fine-tuning our RoBERTa-Large model
I hope to perform the ensemble inference on a same validation data on multiple GPUs (i.e. 4 GPUS). Originally, there was some data parallellism in this framework, and if I just used one single model for inference, it worked well with utility for all 4 GPUs above 85%.
The fact is that for PyTorch 1.3.0 and 1.3.1, the basic neural net functions (model evaluation, backward differentiation, optimization stepping) are all optimized to use all available cores. So in the case of one process thread, all 16 cores are dividing the work.
This tutorial will help you implement Model Parallelism (splitting the model layers into multiple GPUs) to help train larger models over multiple GPUs. We would be using the RoBERTa-Large model from Hugging Face Transformers.
In the case of Convnd
in place of Linear
you could use the groups
argument for "grouped convolutions" (a.k.a. "depthwise convolutions"). This let's you handle all parallel networks simultaneously.
If you use a convolution kernel of size 1
, then the convolution does nothing else than applying a Linear
layer, where each channel is considered an input dimension. So the rough structure of your network would look like this:
B x dim_state
as follows: add an additional dimension and replicate by nb_state
-times B x dim_state
to B x (dim_state * nb_heads) x 1
Linear
with nn.Conv1d(in_channels=dim_state * nb_heads, out_channels=hidden_size * nb_heads, kernel_size=1, groups=nb_heads)
and
nn.Conv1d(in_channels=hidden_size * nb_heads, out_channels=dim_action * nb_heads, kernel_size=1, groups=nb_heads)
B x (dim_action x nb_heads) x 1
you can now modify it to whatever shape you want (e.g. B x nb_heads x dim_action
)While CUDA natively supports grouped convolutions, there were some issues in pytorch with the speed of grouped convolutions (see e.g. here) but I think that was solved now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With