Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why should preprocessing be done on CPU rather than GPU?

Tags:

tensorflow

The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are

  1. This prevent the data from going from CPU to GPU to CPU to GPU back again.
  2. This frees the GPU of these tasks to focus on training.

I am not sure to understand either arguments.

  1. Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU? Why preprocessing operations and not any other operation on the graph, why are they/should they be special?
  2. Even though I understand the rationale behind putting the CPU to work rather than keeping it idle, compared to the huge convolutions and other gradient backpropagation a training step has to do, I would have assumed that random cropping, flip and other standard preprocessing steps on the input images should be nowhere near in term of computation needs, and should be executed in a fraction of the time. Even if we think of preprocessing as mostly moving things around (crop, flips), I think GPU memory should be faster for that. Yet doing preprocessing on the CPU can yield a 6+-fold increase in throughput according to the same guide.

I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.

Can somebody make sense out of this?

like image 399
P-Gn Avatar asked Jun 05 '17 20:06

P-Gn


People also ask

Is GPU or CPU more important for data science?

When it comes to data analytics, GPUs can handle several tasks at once because of their massive parallelism. However, CPUs are more versatile in the tasks they can perform, because GPUs usually have limited applicability for crunching data. Different processing units are best suited to distinct tasks.

Why is a GPU better than CPU?

High Data Throughput: a GPU consist of hundreds of cores performing the same operation on multiple data items in parallel. Because of that, a GPU can push vast volumes of processed data through a workload, speeding up specific tasks beyond what a CPU can handle.

Why GPUs are preferred in data mining compared to CPUs?

High data throughput—a GPU can perform the same operation on many data points in parallel, so that it can process large data volumes at speeds unmatched by CPUs. Massive parallelism—a GPU has hundreds of cores, allowing it to perform massively parallel calculations, such as matrix multiplications.

Is a better CPU or GPU better?

GPUs are designed to process graphics instructions much faster than CPU, which is why most game consoles use GPUs instead of CPUs for rendering images. However, since GPUs compute instructions faster, they also consume more power.


1 Answers

It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.

For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.

How we can achieve this with the combination of CPU and GPU processing?

We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]

Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.

As Dey, 2014 [1] have mentioned:

When a parallelized program is computed on the GPGPU, first the data is copied from the memory to the GPU and after computation the data is written back to the memory from the GPU using the PCI-e bus (Refer to Fig. 1.8). Thus for every computation, data has to be copied to and fro device-host-memory. Although the computation is very fast in GPGPU, but because of the gap between the device-host-memory due to communication via PCI-e, the bottleneck in the performance is generated.

enter image description here

For this reason it is advisable that:

You do the preprocessing on CPU, where the CPU does the initial computation, prepares and sends the rest of the repetitive parallelised tasks to the GPU for further processing.

I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:

EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)

Now, to answer your question:

Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?

As quoted from the performance guide of Tensorflow [2],

When preprocessing occurs on the GPU the flow of data is CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back and forth between the CPU and GPU.

If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:

  • After computation of nodes on GPU, data is sent back on the memory and CPU fetches that memory for further processing. GPU does not have enough memory on-board (on GPU itself) to keep all the data on it for computational prupose. So back-and-forth of data is inevitable. To optimise this data flow, you do preprocessing on CPU, then the data (for training purposes), which is prepared for parallelizable tasks, is sent to the memory and GPU fetches that preprocessed data and work on it.

In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of

If your training loop runs faster when using SSDs vs HDDs for storing your input data, you could could be I/O bottlenecked.If this is the case, you should pre-process your input data, creating a few large TFRecord files.

This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.

Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!

References:

[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014

[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide

like image 64
Somdip Dey Avatar answered Sep 20 '22 12:09

Somdip Dey