CUDA warps and occupancy

Tags:

cuda

I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that "Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently". So more than one warp can run at one time? How does this work?

Thank you.

994

asked Apr 19 '11 01:04

Rayne

1 Answers

"Running" might be better interpreted as "having state on the SM and/or instructions in the pipeline". The GPU hardware schedules up as many blocks as are available or will fit into the resources of the SM (whichever is smaller), allocates state for every warp they contain (ie. register file and local memory), then starts scheduling the warps for execution. The instruction pipeline seems to be about 21-24 cycles long, and so there are a lot of threads in various stages of "running" at any given time.

The first two generations of CUDA capable GPU (so G80/90 and G200) only retire instructions from a single warp per four clock cycles. Compute 2.0 devices dual-issue instructions from two warps per two clock cycles, so there are two warps retiring instructions per clock. Compute 2.1 extends this by allowing what is effectively out of order execution - still only two warps per clock, but potentially two instructions from the same warp at a time. So the extra 16 cores per SM get used for instruction level parallelism, still issued from the same shared scheduler.

answered Oct 09 '22 09:10

talonmies

Related questions
                            
                                Is there an IDE that works with CUDA on mac osx lion?
                            
                                Nsight skips (ignores) over break points in VS10 Cuda works fine, nsight consistently skips over several breakpoints
                            
                                blocks, threads, warpSize
                            
                                How to evaluate CUDA performance?
                            
                                fmad=false gives good performance
                            
                                Crashing a kernel gracefully
                            
                                Can't get simple CUDA program to work
                            
                                cuBLAS argmin -- segfault if outputing to device memory?
                            
                                How to use template functions and CUDA
                            
                                Polymorphism and derived classes in CUDA / CUDA Thrust
                            
                                Hello World CUDA compilation issues
                            
                                Amount of local memory per CUDA thread
                            
                                How to use constant memory for beginners (Cuda C)
                            
                                Armadillo+NVBLAS into RcppArmadillo+NVBLAS
                            
                                Difference between @cuda.jit and @jit(target='gpu')
                            
                                cudaMemcpy transfer kinds: Default vs HostToDevice/DeviceToHost
                            
                                How to use only one GPU for tensorflow session?
                            
                                Can I run a Docker container with CUDA 10 when host has CUDA 9?
                            
                                OpenCL examples with benchmarks
                            
                                PyCUDA Passing variable by value to kernel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With