Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi GPU Training in Tensorflow (Data Parallelism) when Using feed_dict

Tags:

tensorflow

I would like to use multiple GPUs to train my Tensorflow model taking advantage of data parallelism.

I am currently training a Tensorflow model using the following approach:

x_ = tf.placeholder(...)
y_ = tf.placeholder(...)
y = model(x_)
loss = tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=y)
optimizer = tf.train.AdamOptimizer()
train_op = tf.contrib.training.create_train_op(loss, optimizer)
for i in epochs:
   for b in data:
      _ = sess.run(train_op, feed_dict={x_: b.x, y_: b.y})

I would like to take advantage of multiple GPUs to train this model in a data parallelize manner. i.e. I would like to split my batches in half and run each half batch on one of my two GPUs.

cifar10_multi_gpu_train seems to provide a good example of creating a loss that draws from graphs running on multiple GPUs, but I haven't found a good examples of doing this style of training when using feed_dict and placeholder as opposed to a data loader queue.

UPDATE

Seems like: https://timsainb.github.io/multi-gpu-vae-gan-in-tensorflow.html might provide a good example. They seem to pull in average_gradients from cifar10_multi_gpu_train.py and create one placeholder which they then slice into for each of the GPUs. I think you also need to split create_train_op into three stages: compute_gradients, average_gradients and then apply_gradients.

like image 408
Alex Rothberg Avatar asked Apr 05 '17 21:04

Alex Rothberg


People also ask

How do I use multiple GPUs with TensorFlow?

If you have more than one GPU, the GPU with the lowest ID will be selected by default. However, TensorFlow does not place operations into multiple GPUs automatically. To override the device placement to use multiple GPUs, we manually specify the device that a computation node should run on.

Is a TensorFlow API to distribute training across multiple GPUs multiple machines or TPUs?

Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.

What is the advantage of using distributed training in TensorFlow?

Advantages. It can train large models with millions and billions of parameters like: GPT-3, GPT-2, BERT, et cetera. Potentially low latency across the workers. Good TensorFlow community support.


1 Answers

I know three ways of feeding data on multi-gpu model.

  1. if all your inputs are of same shape, you may build placeholder x on CPU, then use tf.split to split x into xs. Then on each tower of GPU, get xs[i] as your input.
with tf.device("/cpu:0"):
    encoder_inputs = tf.placeholder(tf.int32, [None, None], name="encoder_inputs")
    encoder_length = tf.placeholder(tf.int32, [None,], name="encoder_length")

    # make sure batch % num_gpu == 0
    inputs = tf.split(encoder_inputs, axis=0)  # axis=0, split on batch dimension
    lens = tf.split(encoder_length, axis=0)

with tf.variable_scope(tf.get_variable_scope()):
    for i in range(num_gpus):
        with tf.device("/gpu:%d"%i):
            with tf.name_scope("tower_%d"%i):
                loss = compute_loss(inputs[i], lens[i])

  1. if your inputs have different shape, you need to build placeholder x on every GPU with a scope.

def init_placeholder(self):
    with tf.variable_scope("inputs"):   # use a scope
        encoder_inputs = tf.placeholder(tf.int32, [None, None], name="encoder_inputs")
        encoder_length = tf.placeholder(tf.int32, [None,], name="encoder_length")
    return encoder_inputs, encoder_length

with tf.variable_scope(tf.get_variable_scope()):
    for g, gpu in enumerate(GPUS):
        with tf.device("/gpu:%d"%gpu):
            with tf.name_scope("tower_%d"%g):
                x, x_len = model.init_placeholder()  # these placeholder Tensor are on GPU
                loss = model.compute_loss(x, x_len)
  1. use tf.data.Dataset to feed data. google official cifar10_multi_gpu_train.py use Queue, which is similar with this way.
like image 62
huosan0123 Avatar answered Jan 04 '23 02:01

huosan0123