Performance penalty when invoking a cuda kernel

Question

I'm wondering what the overhead of performing a cuda kernel call is in C/C++ such as the following:

somekernel1<<<blocks,threads>>>(args);
somekernel2<<<blocks,threads>>>(args);
somekernel3<<<blocks,threads>>>(args);

The reason why I am asking this is because the application I am building currently makes repeated calls into several kernels (without memory being re-read/written to the device between calls) and I'm wondering if wrapping these kernel calls into a single kernel call (with somekernel1-3 becoming device functions) would make any meaningful difference in performance.

talonmies · Accepted Answer

The host side overhead of a kernel launch uaing the runtime API is only about 15-30 microseconds on non-WDDM Windows platforms. On WDDM platforms (which I don't use), I understand it can be much, much higher, plus there is some sort of batching mechanism in the driver which tries to amortise the cost by doing multiple operations in a single driver side operation.

Generally, there will be a performance increase in "fusing" multiple data operations which would otherwise be done in separate kernels into a single kernel, where the algorithms allow it. The GPU has much higher arithmetic peak performance than peak memory bandwidth, so the more FLOPs which can be executed per memory transaction (and per kernel "setup code"), the better the performance of the kernel will be. On the other hand, trying to write a "swiss army knife" style kernel which tries to cram completely disparate operations into a single piece of code is never a particularly good idea, because it increases register pressure and reduce the efficiency of things like L1, constant memory and texture caches.

Which way you choose to go should really be guided by the nature of the code/algorithms. I don't believe there is a single "correct" answer to this question that can be applied in all circumstances.

Caian · Answer

If you are using Visual Studio Pro on Windows I sugest you run a test application using NVidia's Parallel NSight, I think it can tell you the time stamps from the method call to the real execution, in any case a penalty is inherent, but it will be negligible if your kernels lasts long enought.

Performance penalty when invoking a cuda kernel

Tags:

c++

c

cuda

NothingMore

2 Answers

talonmies

Caian

Recent Activity

Donate For Us

Performance penalty when invoking a cuda kernel

Tags:

c++

c

cuda

NothingMore

2 Answers

talonmies

Caian

Related questions

Recent Activity

Donate For Us