Why should preprocessing be done on CPU rather than GPU?

Tags:

tensorflow

The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are

This prevent the data from going from CPU to GPU to CPU to GPU back again.
This frees the GPU of these tasks to focus on training.

I am not sure to understand either arguments.

Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU? Why preprocessing operations and not any other operation on the graph, why are they/should they be special?
Even though I understand the rationale behind putting the CPU to work rather than keeping it idle, compared to the huge convolutions and other gradient backpropagation a training step has to do, I would have assumed that random cropping, flip and other standard preprocessing steps on the input images should be nowhere near in term of computation needs, and should be executed in a fraction of the time. Even if we think of preprocessing as mostly moving things around (crop, flips), I think GPU memory should be faster for that. Yet doing preprocessing on the CPU can yield a 6+-fold increase in throughput according to the same guide.

I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.

Can somebody make sense out of this?

399

asked Jun 05 '17 20:06

P-Gn

1 Answers

It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.

For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.

How we can achieve this with the combination of CPU and GPU processing?

We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]

Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.

As Dey, 2014 [1] have mentioned:

When a parallelized program is computed on the GPGPU, first the data is copied from the memory to the GPU and after computation the data is written back to the memory from the GPU using the PCI-e bus (Refer to Fig. 1.8). Thus for every computation, data has to be copied to and fro device-host-memory. Although the computation is very fast in GPGPU, but because of the gap between the device-host-memory due to communication via PCI-e, the bottleneck in the performance is generated.

enter image description here

For this reason it is advisable that:

You do the preprocessing on CPU, where the CPU does the initial computation, prepares and sends the rest of the repetitive parallelised tasks to the GPU for further processing.

I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:

EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)

Now, to answer your question:

Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?

As quoted from the performance guide of Tensorflow [2],

When preprocessing occurs on the GPU the flow of data is CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back and forth between the CPU and GPU.

If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:

After computation of nodes on GPU, data is sent back on the memory and CPU fetches that memory for further processing. GPU does not have enough memory on-board (on GPU itself) to keep all the data on it for computational prupose. So back-and-forth of data is inevitable. To optimise this data flow, you do preprocessing on CPU, then the data (for training purposes), which is prepared for parallelizable tasks, is sent to the memory and GPU fetches that preprocessed data and work on it.

In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of

If your training loop runs faster when using SSDs vs HDDs for storing your input data, you could could be I/O bottlenecked.If this is the case, you should pre-process your input data, creating a few large TFRecord files.

This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.

Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!

References:

[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014

[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide

answered Sep 20 '22 12:09

Somdip Dey

Related questions
                            
                                Can ReLU handle a negative input?
                            
                                PyTorch equivalence for softmax_cross_entropy_with_logits
                            
                                Graph optimizations on a tensorflow serveable created using tf.Estimator
                            
                                how is total loss calculated over multiple classes in Keras?
                            
                                Design patterns for tensorflow models
                            
                                Difference between tf.assign and assignment operator (=)
                            
                                About tf.nn.softmax_cross_entropy_with_logits_v2
                            
                                "Solving Environment" during `conda install -c <my_channel> tensorflow` takes 3+ min but changing the name a bit reduces the time significantly
                            
                                Tensorflow warning: The graph couldn't be sorted in topological order?
                            
                                Loading Images in a Directory As Tensorflow Data set
                            
                                How can I use 100% of VRAM on a secondary GPU from a single process on windows 10?
                            
                                How to control when to compute evaluation vs training using the Estimator API of tensorflow?
                            
                                What has to be inside tf.distribute.Strategy.scope()?
                            
                                Differences between CV2 image processing and tf.image processing
                            
                                Creating many feature columns in Tensorflow
                            
                                Error when building seq2seq model with tensorflow
                            
                                Is there a function to extract image patches in PyTorch?
                            
                                How to visualize a tensor summary in tensorboard
                            
                                How do I fix a dimension error in TensorFlow?
                            
                                TensorFlow freeze_graph.py: The name 'save/Const:0' refers to a Tensor which does not exist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With