We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently. A snippet of code below shows how we are doing this. <pre class="prettyprint"><code> train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs) val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs) is_validating = tf.placeholder(dtype=bool, shape=()) next_batch = tf.cond(is_validating, lambda: val_iterator.get_next(), lambda: train_iterator.get_next()) validation_tower = self.num_gpus - 1 tower_grads = [] for i in range(self.num_gpus): with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)): with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope: if i == validation_tower: images, labels = next_batch # Loss funcs snipped out else: images, labels = next_batch # Loss funcs snipped out </code></pre> The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts. The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset The question I have is: Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?

At present there are three main options, which have different usability and performance trade-offs: <ol> <li>In the <code>Dataset.batch()</code> transform, create a single large batch containing examples for all of your GPUs. Then use <code>tf.split(..., self.num_gpus)</code> on the output of <code>Iterator.get_next()</code> to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.</li> <li>In the <code>Dataset.batch()</code> transform, create a mini-batch that is sized for a single GPU. Then call <code>Iterator.get_next()</code> once per GPU to get multiple different batches. (By contrast, in your current code, the same value of <code>next_batch</code> is sent to each GPU, which is probably not what you wanted to happen.) </li> <li>Create multiple iterators, one per GPU. Shard the data using <code>Dataset.shard()</code> early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism</li> </ol> Note that the current <code>tf.data</code> pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the <code>tf.data</code> API directly.

How does one move data to multiple GPU towers using Tensorflow's Dataset API

Tags:

tensorflow

tensorflow-datasets

tensorflow-gpu

We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently.

A snippet of code below shows how we are doing this.

    train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs)
    val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs)


    is_validating = tf.placeholder(dtype=bool, shape=())
    next_batch = tf.cond(is_validating,
               lambda: val_iterator.get_next(),
               lambda: train_iterator.get_next())

    validation_tower = self.num_gpus - 1
    tower_grads = []

    for i in range(self.num_gpus):
        with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)):
            with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope:
                if i == validation_tower:
                    images, labels = next_batch
                    # Loss funcs snipped out
                else:
                    images, labels = next_batch
                    # Loss funcs snipped out

The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts.

The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset

The question I have is:

Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?

229

asked Oct 26 '17 23:10

7hacker

1 Answers

At present there are three main options, which have different usability and performance trade-offs:

In the Dataset.batch() transform, create a single large batch containing examples for all of your GPUs. Then use tf.split(..., self.num_gpus) on the output of Iterator.get_next() to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.
In the Dataset.batch() transform, create a mini-batch that is sized for a single GPU. Then call Iterator.get_next() once per GPU to get multiple different batches. (By contrast, in your current code, the same value of next_batch is sent to each GPU, which is probably not what you wanted to happen.)
Create multiple iterators, one per GPU. Shard the data using Dataset.shard() early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism

Note that the current tf.data pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the tf.data API directly.

answered Sep 28 '22 05:09

mrry

Related questions
                            
                                EM score in SQuAD Challenge
                            
                                How use TPU in google colab
                            
                                tensorflow store training data on GPU memory
                            
                                Tensorflow while_loop for training
                            
                                How to restore a model by filename in Tensorflow r12?
                            
                                Visualize Gensim Word2vec Embeddings in Tensorboard Projector
                            
                                How to Install tensorflow addons via conda
                            
                                ValueError: Input 0 of layer sequential is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: [8, 28, 28]
                            
                                Prevent over-fitting of text classification using Word embedding with LSTM
                            
                                Create MS COCO style dataset
                            
                                What are the uses of TimeDistributed wrapper for LSTM or any other layers
                            
                                How to fix ' module 'keras.backend.tensorflow_backend' has no attribute '_is_tf_1''
                            
                                Set weight and bias tensors of tensorflow conv2d operation
                            
                                How to get the count of an element in a tensor in TensorFlow?
                            
                                How to print full (not truncated) tensor in tensorflow?
                            
                                Tensorflow Object Detection API no train.py file
                            
                                Bilinear upsample in tensorflow?
                            
                                How can I speed up deep learning on a non-NVIDIA setup?
                            
                                How to avoid overfitting on a simple feed forward network
                            
                                Using Tensorflow Layers in Keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With