CUDA concurrent kernel execution with multiple kernels per stream

Tags:

Using different streams for CUDA kernels makes concurrent kernel execution possible. Therefore n kernels on n streams could theoretically run concurrently if the they are fitting into the hardware, right?

Now I'm facing the following problem: There are not n distinct kernels but n*m where m kernels need to be executed in order. For instance n=2 and m=3 would lead to the following execution scheme with streams:

Stream 1: <<<Kernel 0.1>>> <<<Kernel 1.1>>> <<<Kernel 2.1>>>
Stream 2: <<<Kernel 0.2>>> <<<Kernel 1.2>>> <<<Kernel 2.2>>>

My naive assumption is that the kernels x.0 and y.1 should execute concurrently (from a theoretic point of view) or at least not consecutively (from a practical point of view). But my measurements are showing me that this is not the case and it seems that consecutive execution is performed (i. e. K0.0, K1.0, K2.0, K0.1, K1.1, K2.1). The kernels itself are very small, so concurrent execution should not be a problem.

Now my approach would be to accomplish a kind of dispatching for making sure that the kernels are en-queued in an interleaved style into the scheduler on the GPU. But when dealing with a large number of streams / kernels this could do more harm than good.

Alright, coming straight to the point: What would be an appropriate (or at least different) approach to solve this situation?

Edit: Measurements are done by using CUDA events. I've measured the time that is needed to fully solve the computation, i. e. the GPU has to compute all n * m kernels. The assumption is: On fully concurrent kernel execution the execution time is roughly (ideally) 1/n times of the time that is needed to execute all kernels in order, whereby it must be possible that two or more kernels can be executed concurrently. I'm ensuring this by only using two distinct streams right now.

I can measure a clear difference regarding execution times between using the streams as described and dispatching kernels interleaved, i. e.:

Loop: i = 0 to m
    EnqueueKernel(Kernel i.1, Stream 1)
    EnqueueKernel(Kernel i.2, Stream 2)

versus

Loop: i = 1 to n
    Loop: j = 0 to m
        EnqueueKernel(Kernel j.i, Stream i)

The latter leads to a longer runtime.

Edit #2: Changed the Stream numbers to begin by 1 (instead of 0, see comments below).

Edit #3: Hardware is a NVIDIA Tesla M2090 (i. e. Fermi, compute capability 2.0)

467

asked Feb 16 '12 12:02

Sebastian Dressler

1 Answers

On Fermi (aka Compute Capability 2.0) hardware it is best to interleave kernel launches to multiple streams rather than to launch all kernels to one stream, then the next stream, etc. This is because the hardware can immediately launch kernels to different streams if there are sufficient resources, whereas if subsequent launches are to the same stream there is often delay introduced, reducing concurrency. This is the reason that your first approach performs better, and this approach is the one you should choose.

Enabling profiling can also disable concurrency on Fermi, so be careful with that. Also, be careful about using CUDA events during your launch loop, as these can interfere -- best to time the whole loop using events as you are doing, for example.

answered Oct 11 '22 15:10

harrism

Related questions
                            
                                Guarding the initialization of a non-volatile field with a lock?
                            
                                Java - How to Create a MultiThreaded Game using SwingWorker
                            
                                How to use Applicative for concurrency?
                            
                                Out-of-order execution and reordering: can I see what after barrier before the barrier?
                            
                                executorService.scheduleAtFixedRate to run task forever
                            
                                Etags used in RESTful APIs are still susceptible to race conditions
                            
                                What's the relationship between forkOn and the -qm RTS flag?
                            
                                Does `isync` prevent Store-Load reordering on CPU PowerPC?
                            
                                Event Sourcing: concurrently creating conflicting events
                            
                                AWS Lambda async code execution
                            
                                How to identify if cancelled ScheduledFuture is actually not cancelled?
                            
                                Why does parallelStream use a ForkJoinPool, not a normal thread pool?
                            
                                Are these threads blocked forever?
                            
                                Java Transport.send() is it thread-safe?
                            
                                Java Concurrency: lock effiency
                            
                                Performance bottleneck in concurrent calls to System.currentTimeInMillis()
                            
                                Pattern for blocking Java Swing user in worker thread
                            
                                Safe Publication without happens-before? Anyhow besides final?
                            
                                Are seda and the actor model essentially equivalent?
                            
                                How to find identical byte[]-objects in two arrays concurrently?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA concurrent kernel execution with multiple kernels per stream

Tags:

concurrency

cuda

Sebastian Dressler

People also ask

1 Answers

harrism

Recent Activity

Donate For Us