Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementation of model parallelism in tensorflow

I'm currently working on a system with 2 GPUs each of 12GB. I want to implement model parallelism across the two GPUs to train large models. I have been looking through all over the internet, SO, tensorflow documentation, etc, i was able to find the explanations of model parallelism and its results but nowhere did i find a small tutorial or small code snippets on how to implement it using tensorflow. I mean we have to exchange activations after every layer right? So how do we do that? Is there a specific or cleaner ways of implementing model parallelism in tensorflow? It would be very helpful if you could suggest me a place where i can learn to implement it or a simple code like mnist training on multiple GPU using 'MODEL PARALLELISM'.

Note: I have done data parallelism like in CIFAR10 - multi gpu tutorial but i haven't found any implementation of model parallelism.

like image 358
krish567 Avatar asked Feb 06 '17 13:02

krish567


People also ask

Do I need mesh TensorFlow for data-parallel training?

If you just want data-parallel training (batch-splitting), then you do not need Mesh TensorFlow, though Mesh TensorFlow can do this. The most common reasons for more sophisticated parallel computation are: The parameters of the model do not fit on one device – e.g. a 5-billion-parameter language model.

How does GPU tensor parallelism work?

In Tensor Parallelism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing. In this section we use concepts and diagrams from the Megatron-LM paper: Efficient Large-Scale Language Model Training on GPU Clusters.

What are the applications of parallelism in machine learning?

In the modern machine learning the various approaches to parallelism are used to: fit very large models onto limited hardware - e.g. t5-11b is 45GB in just model params significantly speed up training - finish training that would take a year in hours

What is the significance of model-wise parallelism?

It is similar with tensor model parallelism or naive layer-wise model parallelism. The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) fast-intra-connect/slow-inter-connect and it automatically optimizes all these algorithmically deciding which parallelisation to use where.


1 Answers

Here's an example. The model has some parts on GPU0, some parts on GPU1 and some parts on CPU, so this is 3 way model parallelism.

with tf.device("/gpu:0"):
    a = tf.Variable(tf.ones(()))
    a = tf.square(a)
with tf.device("/gpu:1"):
    b = tf.Variable(tf.ones(()))
    b = tf.square(b)
with tf.device("/cpu:0"):
    loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
    loss0, _ = sess.run([loss, train_op])
    print("loss", loss0)
like image 173
Yaroslav Bulatov Avatar answered Oct 20 '22 12:10

Yaroslav Bulatov