Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a maximum number of streams in CUDA?

Tags:

cuda

People also ask

What are streams in CUDA?

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.

Are CUDA kernels asynchronous?

3 Answers. Show activity on this post. Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.

What is a CUDA event?

CUDA events are synchronization markers that can be used to monitor the device's progress, to accurately measure timing, and to synchronize CUDA streams. The underlying CUDA events are lazily initialized when the event is first recorded or exported to another process.


There is no realistic limit to the number of streams you can create (at least 1000's). However, there's a limit to the number of streams you can use effectively to achieve concurrency.

In Fermi, the architecture supports 16-way concurrent kernel launches, but there is only a single connection from the host to the GPU. So even if you have 16 CUDA streams, they'll eventually get funneled into one HW queue. This can create false data-dependencies, and limit the amount of concurrency one can easily get.

With Kepler, the number of connections between the Host and the GPU is now 32 (instead of one with Fermi). With the new Hyper-Q technology, it is now much easier to keep the GPU busy with concurrent work.


I haven't seen a limit in any documentation, but that doesn't mean all streams will execute concurrently, since that is a hard hardware limit (Multiprocessors, registers, etc).


According to this NVIDIA presentation, max is 16 streams (on Fermi). http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

To clarify, I've successfully created more than 16 streams, but I think the hardware can only support 16 concurrent kernels, so the excess ones are wasted in terms of concurrency.

Kepler is probably different.