1. Problem :
I have a tf.data.Dataset
that I give to a Keras model (tf.python.keras) with train_on_batch
.
My dataset looks like this :
Generate TFRecord path > tf.data.TFRecordDataset > Parse single example > Batch(2) > Map(merge) > Map(normalize) > Map(split to inputs,labels) > Batch(batch_size) > Prefetch(1)
I used RunMetadata
to output a Timeline readable with Chrome.
Looks like IteratorGetNext
is only ran on the CPU and is eating a significant amount of time.
(I can't post images, IteratorGetNext
took 617ms, MEMCPYHtoD
took 58ms and training took 500ms)
I can't seem to find a way to get IteratorGetNext to run on the GPU, even partially. Currently, CPU is used at 100% and GPU at 40-60% at most.
I would expect something like :
Read from disk > Move from CPU to GPU > Preprocess.
I am currently using only one GPU, but I plan to use more GPUs later so a scalable solution would be perfect !
By the way, I am using tensorflow-gpu 1.13.1 on Windows 10 with CUDA 10.0 and python 3.6.7. I am not using eager mode. I haven't tried on Ubuntu but it is a possibility.
2. What I tried :
I tried using prefetch_to_device
and copy_to_device
from tf.data.experimental
, in several places in the pipeline.
When using copy_to_device
, IteratorGetNext took twice as long. It looked like it was copying on the GPU to only copy back to the CPU because the MEMCPYHtoD
was still present after IteratorGetNext.
I tried replacing Keras' train_on_batch
with session.run(train_op)
but it did not really improve, the only change I noticed was that some prefetching actually happened, reducing IteratorGetNext time for a few samples (independent of the amount I put in "prefetch").
By the way, prefetch(1)
or prefetch(tf.data.experimental.AUTOTUNE)
did not seem to have any impact.
I tried session.run
both with and without copy_to_device
.
I also tried to put the building of the dataset in with tf.device("/gpu:0")
.
3. Some code :
dataset = tf.data.Dataset.from_generator(self.random_shard_filepath_generator,
output_types=tf.string,
output_shapes=())
dataset = tf.data.TFRecordDataset(dataset)
dataset = dataset.map(lambda serialized_shard: self.parse_shard(serialized_shard, output_labels))
dataset = dataset.batch(self.shards_per_sample)
dataset = dataset.map(self.join_shards_randomly)
dataset = dataset.map(self.normalize_batch)
dataset = dataset.map(self.split_batch_io)
dataset = dataset.batch(batch_size).prefetch(1)
autoencoder.train_on_batch(dataset)
Finally, I would add that my model may just not be big enough and I could improve the ratio by just making it "bigger", but it does not feel like a great solution.
-- Edit :
I had :
...
dataset = dataset.batch(batch_size).prefetch(1)
autoencoder.train_on_batch(dataset)
Which I changed to :
...
dataset = dataset.batch(batch_size).prefetch(1)
dataset_iterator = dataset.make_initializable_iterator()
dataset_initializer = dataset_iterator.initializer
session.run(dataset_initializer)
x, y = dataset_iterator
autoencoder.train_on_batch(x, y)
Thanks to EdoardoG
for making me try MultiDeviceIterator
which made me create an Iterator
outside of Keras' train_on_batch
.
Now IteratorGetNext
only takes about 0.05ms where it took previously about 600ms.
As far as I know, Dataset API operations are usually run on CPU, so it's actually normal that you cannot run your input pipeline on the GPU.
Someone has written an iterator which could solve your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With