Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we benchmark how fast CUDA or OpenCL is compared to CPU performance?

How much faster can an algorithm on CUDA or OpenCL code run compared to a general single processor core? (considering the algorithm is written and optimized for both the CPU and GPU target).

I know it depends on both the graphics card and the CPU, but say, one of the fastest GPUs of NVIDIA and a (single core of a) Intel i7 processor ?

And I know it also depends on the type of algorithm.

I do not need a strict answer, but experienced examples like: for a image manipulation algorithm using double-precision floating point and 10 operations per pixel took first 5 minutes and now runs in x seconds using this hardware.

like image 892
Roalt Avatar asked Nov 24 '10 15:11

Roalt


People also ask

How much faster is CUDA than OpenCL?

A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.

Is CUDA faster than CPU?

Where for larger arrays, the CUDA outperforms the CPU by large margins. On a large scale, it looks like the CUDA times are not increasing, but if we only plot the CUDA times, we can see that it also increases linearly.

Is CUDA better than OpenCL?

Open-source vs commercial Generally if the app of your choice supports both CUDA and OpenCL, going with CUDA is the best option as it generates better performance results in this scenario. This is because NVIDIA provides top quality support.

Is OpenCL faster?

The OpenCL package has a nice test set that compares its own output against the reference project. On the two simplest test cases, OpenCL runs about 14 and 24 times as fast as on the CPU.


1 Answers

Your question is overly broad, and very difficult to answer. Moreover only a small percentage of algorithms (the ones that deal without much shared state) are feasable with GPUs.

But I do want to urge you to be critical about claims. I'm in imageprocessing, and read many an article on the subject, but quite often in the GPU case, the time to upload input data to the GPU, and download the results back to main memory is not included in the calculation of the factor.

While there are a few cases where this doesn't matter (both are small or there is a second stage calculation that further reduces the result in size), usually one does have to transfer the results and initial data.

I've seen this turning a claimed plus into a negative, because the upload/download time alone was longer than the main CPU would require to do the calculation.

Pretty much the same thing applies to combining results of different GPU cards.

Update Newer GPUs seem to be able to upload/download and calculate at the same time using ping-pong buffers. But the advise to check the border conditions thoroughly still stands. There is a lot of spin out there.

Update 2 Quite often using a GPU that is shared with video output for this is not optimal. Consider e.g. adding a low budget card for video, and using the onboard video for GPGPU tasks

like image 151
Marco van de Voort Avatar answered Oct 02 '22 19:10

Marco van de Voort