The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are
I am not sure to understand either arguments.
I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.
Can somebody make sense out of this?
When it comes to data analytics, GPUs can handle several tasks at once because of their massive parallelism. However, CPUs are more versatile in the tasks they can perform, because GPUs usually have limited applicability for crunching data. Different processing units are best suited to distinct tasks.
High Data Throughput: a GPU consist of hundreds of cores performing the same operation on multiple data items in parallel. Because of that, a GPU can push vast volumes of processed data through a workload, speeding up specific tasks beyond what a CPU can handle.
High data throughput—a GPU can perform the same operation on many data points in parallel, so that it can process large data volumes at speeds unmatched by CPUs. Massive parallelism—a GPU has hundreds of cores, allowing it to perform massively parallel calculations, such as matrix multiplications.
GPUs are designed to process graphics instructions much faster than CPU, which is why most game consoles use GPUs instead of CPUs for rendering images. However, since GPUs compute instructions faster, they also consume more power.
It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.
For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.
How we can achieve this with the combination of CPU and GPU processing?
We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]
Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.
As Dey, 2014 [1] have mentioned:
When a parallelized program is computed on the GPGPU, first the data is copied from the memory to the GPU and after computation the data is written back to the memory from the GPU using the PCI-e bus (Refer to Fig. 1.8). Thus for every computation, data has to be copied to and fro device-host-memory. Although the computation is very fast in GPGPU, but because of the gap between the device-host-memory due to communication via PCI-e, the bottleneck in the performance is generated.
For this reason it is advisable that:
You do the preprocessing on CPU, where the CPU does the initial computation, prepares and sends the rest of the repetitive parallelised tasks to the GPU for further processing.
I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:
EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)
Now, to answer your question:
Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?
As quoted from the performance guide of Tensorflow [2],
When preprocessing occurs on the GPU the flow of data is CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back and forth between the CPU and GPU.
If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:
In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of
If your training loop runs faster when using SSDs vs HDDs for storing your input data, you could could be I/O bottlenecked.If this is the case, you should pre-process your input data, creating a few large TFRecord files.
This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.
Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!
References:
[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014
[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With