Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to maximize CPU utilization in tensorflow GPU training with Dataset API?

Tags:

I am using the new dataset api to train a simple feed-forward DL model. I am interested in maximizing training speed. Since my network size isn't huge, as expected I see low GPU utilization. That is fine. But what I don't understand is why CPU usage is also far from 100%. I am using a multi-cpu/gpu core machine. Currently I get up to 140 steps / sec where batch_size = 128. If I cache the dataset I can get up to 210 steps (after initial scan). So I expect that with sufficient prefetching, I should be able to reach the same speed without caching. However with various prefetching and prefetch_to_device parameters, I cannot get more than 140 steps / sec. I also set num_parallel_calls to the number of cpu cores, which improves by about 20%.

Ideally I'd like the prefetching thread to be on a disjoint cpu core from the rest of the input pipeline, so that whatever benefit it provides is strictly additive. But from the cpu usage profiling I suspect that the prefetching and input processing occur on every core:

enter image description here

Is there a way to have more control over cpu allocation? I have tried prefetch(1), prefetch(500), and several other values (right after batch or at the end of the dataset construction), as well as in combination with prefetch_to_device(gpu_device, batch_size = None, 1, 500, etc). So far prefetch(500) without prefetch_to_device works the best.

Why doesn't prefetch try to exhaust all the cpu power on my machine? What are other possible bottlenecks in training speed?

Many thanks!

like image 867
John Jiang Avatar asked Jul 25 '18 03:07

John Jiang


People also ask

Is TensorFlow better on CPU or GPU?

They noticed that the performance of TensorFlow depends significantly on the CPU for a small-size dataset. Also, they found it is more important to use a graphic processing unit (GPU) when training a large-size dataset.


1 Answers

The Dataset.prefetch(buffer_size) transformation adds pipeline parallelism and (bounded) buffering to your input pipeline. Therefore, increasing the buffer_size may increase the fraction of time when the input to the Dataset.prefetch() is running (because the buffer is more likely to have free space), but it does not increase the speed at which the input runs (and hence the CPU usage).

Typically, to increase the speed of the pipeline and increase CPU usage, you would add data parallelism by adding num_parallel_calls=N to any Dataset.map() transformations, and you might also consider using tf.contrib.data.parallel_interleave() to process many input sources concurrently and avoid blocking on I/O.

The tf.data Performance Guide has more details about how to improve the performance of input pipelines, including these suggestions.

like image 142
mrry Avatar answered Sep 28 '22 18:09

mrry