For my CUDA development, I am using a machine with 16 cores, and 1 GTX 580 GPU with 16 SMs. For the work that I am doing, I plan to launch 16 host threads (1 on each core), and 1 kernel launch per thread, each with 1 block and 1024 threads. My goal is to run 16 kernels in parallel on 16 SMs. Is this possible/feasible?
I have tried to read as much as possible about independent contexts, but there does not seem to be too much information available. As I understand it, each host thread can have its own GPU context. But, I am not sure whether the kernels will run in parallel if I use independent contexts.
I can read all the data from all 16 host threads into one giant structure and pass it to GPU to launch one kernel. However, it will be too much copying and it will slow down the application.
In order to launch a CUDA kernel we need to specify the block dimension and the grid dimension from the host code. I'll consider the same Hello World! code considered in the previous article. In the above code, to launch the CUDA kernel two 1's are initialised between the angle brackets.
Introduction. NVIDIA will present “CUDA Multithreading with Streams” to OLCF and NERSC users on Friday, July 16, 2021. This event is a continuation of the CUDA Training Series. CUDA Streams are a useful way to achieve concurrency and ensure that an application is fully utilizing the GPU.
__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).
A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2). Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).
You can only have one context on a GPU at a time. One way to achieve the sort of parallelism you require would be to use CUDA streams. You can create 16 streams inside the context, and launch memcopies and kernels into streams by name. You can read more in a quick webinar on using streams at : http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. The full API reference is in the CUDA toolkit manuals. The CUDA 4.2 manual is available at http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_Toolkit_Reference_Manual.pdf.
While a multi-threaded application can hold multiple CUDA contexts simultaneously on the same GPU, those contexts cannot perform operations concurrently. When active, each context has sole use of the GPU, and must yield before another context (which could include operations with a rendering API or a display manager) can have access to the GPU.
So in a word, no this strategy can't work with any current CUDA versions or hardware.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With