Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to have a persistent cuda kernel running and communicating with cpu asynchronously ?

Tags:

c++

c

cpu

cuda

gpu

Cuda streams and memorycpyasync as far as I know, need us to label different kernels, memory operations to different streams in order to make the gpu operations concurrent with cpu operations.

But is it possible to have one persistent kernel. This kernel launches one time, looping forever, checking "some flags" to see if there are a piece of data coming from CPU then operating on it. When this "piece of " data finishes, GPU set a "flag" to CPU, CPU sees it and copy the data back. This Kernel shall never finishes running.

Does this exist in current cuda programming model? What will be the closest to this I can get?

like image 541
yidiyidawu Avatar asked Feb 28 '14 20:02

yidiyidawu


People also ask

What is persistent kernel?

Persistent threads/Persistent kernel is a kernel design strategy that allows the kernel to continue execution indefinitely. Typical "ordinary" kernel design focuses on solving a particular task, and when that task is done, the kernel exits (at the closing curly-brace of your kernel code).

Which is the correct way to launch a CUDA kernel?

In order to launch a CUDA kernel we need to specify the block dimension and the grid dimension from the host code. I'll consider the same Hello World! code considered in the previous article. In the above code, to launch the CUDA kernel two 1's are initialised between the angle brackets.

How does CUDA achieve parallelism?

In CUDA Dynamic Parallelism, a parent grid launches kernels called child grids. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size. Note that every thread that encounters a kernel launch executes it.


1 Answers

Yes, it's possible. One approach is to use zero-copy (i.e. GPU mapped) host memory. The host places its data in the mapped area, and the GPU communicates back in the mapped area. Obviously this required polling, but that is inherent in your question.

This answer gives you most of the plumbing you need for a simple test case.

There is also the simple zero-copy sample code.

This answer provides a more involved, fully worked example.

Naturally, you'd want to do this in an environment where there are no timeout watchdogs enabled.

like image 79
Robert Crovella Avatar answered Nov 06 '22 05:11

Robert Crovella