Multi GPU Training in Tensorflow (Data Parallelism) when Using feed_dict

Tags:

tensorflow

I would like to use multiple GPUs to train my Tensorflow model taking advantage of data parallelism.

I am currently training a Tensorflow model using the following approach:

x_ = tf.placeholder(...)
y_ = tf.placeholder(...)
y = model(x_)
loss = tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=y)
optimizer = tf.train.AdamOptimizer()
train_op = tf.contrib.training.create_train_op(loss, optimizer)
for i in epochs:
   for b in data:
      _ = sess.run(train_op, feed_dict={x_: b.x, y_: b.y})

I would like to take advantage of multiple GPUs to train this model in a data parallelize manner. i.e. I would like to split my batches in half and run each half batch on one of my two GPUs.

cifar10_multi_gpu_train seems to provide a good example of creating a loss that draws from graphs running on multiple GPUs, but I haven't found a good examples of doing this style of training when using feed_dict and placeholder as opposed to a data loader queue.

UPDATE

Seems like: https://timsainb.github.io/multi-gpu-vae-gan-in-tensorflow.html might provide a good example. They seem to pull in average_gradients from cifar10_multi_gpu_train.py and create one placeholder which they then slice into for each of the GPUs. I think you also need to split create_train_op into three stages: compute_gradients, average_gradients and then apply_gradients.

408

asked Apr 05 '17 21:04

Alex Rothberg

1 Answers

I know three ways of feeding data on multi-gpu model.

if all your inputs are of same shape, you may build placeholder x on CPU, then use tf.split to split x into xs. Then on each tower of GPU, get xs[i] as your input.

with tf.device("/cpu:0"):
    encoder_inputs = tf.placeholder(tf.int32, [None, None], name="encoder_inputs")
    encoder_length = tf.placeholder(tf.int32, [None,], name="encoder_length")

    # make sure batch % num_gpu == 0
    inputs = tf.split(encoder_inputs, axis=0)  # axis=0, split on batch dimension
    lens = tf.split(encoder_length, axis=0)

with tf.variable_scope(tf.get_variable_scope()):
    for i in range(num_gpus):
        with tf.device("/gpu:%d"%i):
            with tf.name_scope("tower_%d"%i):
                loss = compute_loss(inputs[i], lens[i])

if your inputs have different shape, you need to build placeholder x on every GPU with a scope.


def init_placeholder(self):
    with tf.variable_scope("inputs"):   # use a scope
        encoder_inputs = tf.placeholder(tf.int32, [None, None], name="encoder_inputs")
        encoder_length = tf.placeholder(tf.int32, [None,], name="encoder_length")
    return encoder_inputs, encoder_length

with tf.variable_scope(tf.get_variable_scope()):
    for g, gpu in enumerate(GPUS):
        with tf.device("/gpu:%d"%gpu):
            with tf.name_scope("tower_%d"%g):
                x, x_len = model.init_placeholder()  # these placeholder Tensor are on GPU
                loss = model.compute_loss(x, x_len)

use tf.data.Dataset to feed data. google official cifar10_multi_gpu_train.py use Queue, which is similar with this way.

answered Jan 04 '23 02:01

huosan0123

Related questions
                            
                                error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support
                            
                                Update only part of the word embedding matrix in Tensorflow
                            
                                Create an int list feature to save as tfrecord in tensorflow?
                            
                                Tensorflow object detection API not displaying global steps
                            
                                How does one inspect variables in a checkpoint file in TensorFlow when TensorFlow can't find the tools attribute?
                            
                                How do you decode one-hot labels in Tensorflow?
                            
                                Training and Loss not changing in Keras CNN model
                            
                                ImportError: No module named 'nets'
                            
                                Efficiently Finding Closest Word In TensorFlow Embedding
                            
                                What does "relu" stand for in tf.nn.relu?
                            
                                how to convert logits to probability in binary classification in tensorflow?
                            
                                How does tensorflow batch_matmul work?
                            
                                Why training speed does not scale with the batch size?
                            
                                How to print value of tensorflow.python.framework.ops.Tensor in Tensorflow 2.0?
                            
                                TensorFlow: Performing this loss computation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With