Multiple host threads launching individual CUDA kernels

Tags:

For my CUDA development, I am using a machine with 16 cores, and 1 GTX 580 GPU with 16 SMs. For the work that I am doing, I plan to launch 16 host threads (1 on each core), and 1 kernel launch per thread, each with 1 block and 1024 threads. My goal is to run 16 kernels in parallel on 16 SMs. Is this possible/feasible?

I have tried to read as much as possible about independent contexts, but there does not seem to be too much information available. As I understand it, each host thread can have its own GPU context. But, I am not sure whether the kernels will run in parallel if I use independent contexts.

I can read all the data from all 16 host threads into one giant structure and pass it to GPU to launch one kernel. However, it will be too much copying and it will slow down the application.

762

asked Sep 06 '12 05:09

gmemon

2 Answers

You can only have one context on a GPU at a time. One way to achieve the sort of parallelism you require would be to use CUDA streams. You can create 16 streams inside the context, and launch memcopies and kernels into streams by name. You can read more in a quick webinar on using streams at : http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. The full API reference is in the CUDA toolkit manuals. The CUDA 4.2 manual is available at http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_Toolkit_Reference_Manual.pdf.

answered Oct 15 '22 11:10

Vyas

While a multi-threaded application can hold multiple CUDA contexts simultaneously on the same GPU, those contexts cannot perform operations concurrently. When active, each context has sole use of the GPU, and must yield before another context (which could include operations with a rendering API or a display manager) can have access to the GPU.

So in a word, no this strategy can't work with any current CUDA versions or hardware.

answered Oct 15 '22 11:10

talonmies

Related questions
                            
                                Matrix Multiplication using CUDA
                            
                                Concurrent GPU kernel execution from multiple processes
                            
                                GPU-based inclusive scan on an unbalanced tree
                            
                                Creating a static CUDA library to be linked with a C++ program
                            
                                Understanding Streaming Multiprocessors (SM) and Streaming Processors (SP)
                            
                                Please explain cudaMemcpyToSymbol example code from CUDA Programming Guide
                            
                                OpenCL FFT on both Nvidia and AMD hardware?
                            
                                cuda 'memory bound' vs 'latency bound' vs 'bandwidth bound' vs 'compute bound'
                            
                                What is the difference between the CUDA tookit and the CUDA sdk
                            
                                Is there a way to document cuda's ".cu" file use doxygen
                            
                                cudamemcpy error:"the launch timed out and was terminated"
                            
                                CUDA: Understanding the PTX info
                            
                                How to avoid default construction of elements in thrust::device_vector?
                            
                                cublasSetVector() vs cudaMemcpy()
                            
                                OpenGL Shader vs CUDA
                            
                                Parallel Reduction
                            
                                Performing Fourier Transform with Thrust
                            
                                How to interrupt or cancel a CUDA kernel from host code
                            
                                Can the cuda version in docker container be different with the host machine?
                            
                                Root access required for CUDA?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple host threads launching individual CUDA kernels

Tags:

cuda

cuda-streams

gmemon

People also ask

2 Answers

Vyas

talonmies

Recent Activity

Donate For Us