Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow Dataset API not using GPU

1. Problem :

I have a tf.data.Dataset that I give to a Keras model (tf.python.keras) with train_on_batch.

My dataset looks like this :

Generate TFRecord path > tf.data.TFRecordDataset > Parse single example > Batch(2) > Map(merge) > Map(normalize) > Map(split to inputs,labels) > Batch(batch_size) > Prefetch(1)

I used RunMetadata to output a Timeline readable with Chrome. Looks like IteratorGetNext is only ran on the CPU and is eating a significant amount of time.

(I can't post images, IteratorGetNext took 617ms, MEMCPYHtoD took 58ms and training took 500ms)

I can't seem to find a way to get IteratorGetNext to run on the GPU, even partially. Currently, CPU is used at 100% and GPU at 40-60% at most.

I would expect something like :

Read from disk > Move from CPU to GPU > Preprocess.

I am currently using only one GPU, but I plan to use more GPUs later so a scalable solution would be perfect !

By the way, I am using tensorflow-gpu 1.13.1 on Windows 10 with CUDA 10.0 and python 3.6.7. I am not using eager mode. I haven't tried on Ubuntu but it is a possibility.

2. What I tried :

I tried using prefetch_to_device and copy_to_device from tf.data.experimental, in several places in the pipeline.

When using copy_to_device, IteratorGetNext took twice as long. It looked like it was copying on the GPU to only copy back to the CPU because the MEMCPYHtoD was still present after IteratorGetNext.

I tried replacing Keras' train_on_batch with session.run(train_op) but it did not really improve, the only change I noticed was that some prefetching actually happened, reducing IteratorGetNext time for a few samples (independent of the amount I put in "prefetch").

By the way, prefetch(1) or prefetch(tf.data.experimental.AUTOTUNE) did not seem to have any impact.

I tried session.run both with and without copy_to_device.

I also tried to put the building of the dataset in with tf.device("/gpu:0").

3. Some code :

dataset = tf.data.Dataset.from_generator(self.random_shard_filepath_generator,
                                                 output_types=tf.string,
                                                 output_shapes=())

dataset = tf.data.TFRecordDataset(dataset)
dataset = dataset.map(lambda serialized_shard: self.parse_shard(serialized_shard, output_labels))

dataset = dataset.batch(self.shards_per_sample)
dataset = dataset.map(self.join_shards_randomly)
dataset = dataset.map(self.normalize_batch)
dataset = dataset.map(self.split_batch_io)

dataset = dataset.batch(batch_size).prefetch(1)

autoencoder.train_on_batch(dataset)

Finally, I would add that my model may just not be big enough and I could improve the ratio by just making it "bigger", but it does not feel like a great solution.

-- Edit :

I had :

...
dataset = dataset.batch(batch_size).prefetch(1)
autoencoder.train_on_batch(dataset)

Which I changed to :

...
dataset = dataset.batch(batch_size).prefetch(1)
dataset_iterator = dataset.make_initializable_iterator()
dataset_initializer = dataset_iterator.initializer

session.run(dataset_initializer)

x, y = dataset_iterator
autoencoder.train_on_batch(x, y)

Thanks to EdoardoG for making me try MultiDeviceIterator which made me create an Iterator outside of Keras' train_on_batch.

Now IteratorGetNext only takes about 0.05ms where it took previously about 600ms.

like image 373
Zelgunn Avatar asked Nov 16 '22 09:11

Zelgunn


1 Answers

As far as I know, Dataset API operations are usually run on CPU, so it's actually normal that you cannot run your input pipeline on the GPU.

Someone has written an iterator which could solve your problem.

like image 123
EdoardoG Avatar answered Jun 12 '23 09:06

EdoardoG