I am using the new dataset api to train a simple feed-forward DL model. I am interested in maximizing training speed. Since my network size isn't huge, as expected I see low GPU utilization. That is fine. But what I don't understand is why CPU usage is also far from 100%. I am using a multi-cpu/gpu core machine. Currently I get up to 140 steps / sec where batch_size = 128. If I cache the dataset I can get up to 210 steps (after initial scan). So I expect that with sufficient prefetching, I should be able to reach the same speed without caching. However with various prefetching and prefetch_to_device parameters, I cannot get more than 140 steps / sec. I also set num_parallel_calls to the number of cpu cores, which improves by about 20%.
Ideally I'd like the prefetching thread to be on a disjoint cpu core from the rest of the input pipeline, so that whatever benefit it provides is strictly additive. But from the cpu usage profiling I suspect that the prefetching and input processing occur on every core:
Is there a way to have more control over cpu allocation? I have tried prefetch(1), prefetch(500), and several other values (right after batch or at the end of the dataset construction), as well as in combination with prefetch_to_device(gpu_device, batch_size = None, 1, 500, etc). So far prefetch(500) without prefetch_to_device works the best.
Why doesn't prefetch try to exhaust all the cpu power on my machine? What are other possible bottlenecks in training speed?
Many thanks!
They noticed that the performance of TensorFlow depends significantly on the CPU for a small-size dataset. Also, they found it is more important to use a graphic processing unit (GPU) when training a large-size dataset.
The Dataset.prefetch(buffer_size)
transformation adds pipeline parallelism and (bounded) buffering to your input pipeline. Therefore, increasing the buffer_size
may increase the fraction of time when the input to the Dataset.prefetch()
is running (because the buffer is more likely to have free space), but it does not increase the speed at which the input runs (and hence the CPU usage).
Typically, to increase the speed of the pipeline and increase CPU usage, you would add data parallelism by adding num_parallel_calls=N
to any Dataset.map()
transformations, and you might also consider using tf.contrib.data.parallel_interleave()
to process many input sources concurrently and avoid blocking on I/O.
The tf.data
Performance Guide has more details about how to improve the performance of input pipelines, including these suggestions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With