I'm currently working on a system with 2 GPUs each of 12GB. I want to implement model parallelism across the two GPUs to train large models. I have been looking through all over the internet, SO, tensorflow documentation, etc, i was able to find the explanations of model parallelism and its results but nowhere did i find a small tutorial or small code snippets on how to implement it using tensorflow. I mean we have to exchange activations after every layer right? So how do we do that? Is there a specific or cleaner ways of implementing model parallelism in tensorflow? It would be very helpful if you could suggest me a place where i can learn to implement it or a simple code like mnist training on multiple GPU using 'MODEL PARALLELISM'.
Note: I have done data parallelism like in CIFAR10 - multi gpu tutorial but i haven't found any implementation of model parallelism.
If you just want data-parallel training (batch-splitting), then you do not need Mesh TensorFlow, though Mesh TensorFlow can do this. The most common reasons for more sophisticated parallel computation are: The parameters of the model do not fit on one device – e.g. a 5-billion-parameter language model.
In Tensor Parallelism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing. In this section we use concepts and diagrams from the Megatron-LM paper: Efficient Large-Scale Language Model Training on GPU Clusters.
In the modern machine learning the various approaches to parallelism are used to: fit very large models onto limited hardware - e.g. t5-11b is 45GB in just model params significantly speed up training - finish training that would take a year in hours
It is similar with tensor model parallelism or naive layer-wise model parallelism. The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) fast-intra-connect/slow-inter-connect and it automatically optimizes all these algorithmically deciding which parallelisation to use where.
Here's an example. The model has some parts on GPU0, some parts on GPU1 and some parts on CPU, so this is 3 way model parallelism.
with tf.device("/gpu:0"):
a = tf.Variable(tf.ones(()))
a = tf.square(a)
with tf.device("/gpu:1"):
b = tf.Variable(tf.ones(()))
b = tf.square(b)
with tf.device("/cpu:0"):
loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
loss0, _ = sess.run([loss, train_op])
print("loss", loss0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With