Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA warps and occupancy

Tags:

cuda

I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that "Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently". So more than one warp can run at one time? How does this work?

Thank you.

like image 994
Rayne Avatar asked Apr 19 '11 01:04

Rayne


People also ask

What is CUDA occupancy?

The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU.

What is a CUDA warp?

In CUDA, groups of threads with consecutive thread indexes are bundled into warps; one full warp is executed on a single CUDA core. At runtime, a thread block is divided into a number of warps for execution on the cores of an SM. The size of a warp depends on the hardware.

How do I increase my GPU occupancy?

Occupancy can be increased by increasing block size. For example, on a GPU that supports 16 active blocks and 64 active warps per SM, blocks with 32 threads (1 warp per block) result in at most 16 active warps (25% theoretical occupancy), because only 16 blocks can be active, and each block has only one warp.

How many warps are there in SM?

If we use the maximum number of registers per thread (minimizing that way the number of Global memory accesses), the maximum number of threads running per SM simultaneously is 512 (32768 registers/64 registers per thread = 512 threads per SM or 16 warps per SM).


1 Answers

"Running" might be better interpreted as "having state on the SM and/or instructions in the pipeline". The GPU hardware schedules up as many blocks as are available or will fit into the resources of the SM (whichever is smaller), allocates state for every warp they contain (ie. register file and local memory), then starts scheduling the warps for execution. The instruction pipeline seems to be about 21-24 cycles long, and so there are a lot of threads in various stages of "running" at any given time.

The first two generations of CUDA capable GPU (so G80/90 and G200) only retire instructions from a single warp per four clock cycles. Compute 2.0 devices dual-issue instructions from two warps per two clock cycles, so there are two warps retiring instructions per clock. Compute 2.1 extends this by allowing what is effectively out of order execution - still only two warps per clock, but potentially two instructions from the same warp at a time. So the extra 16 cores per SM get used for instruction level parallelism, still issued from the same shared scheduler.

like image 86
talonmies Avatar answered Oct 09 '22 09:10

talonmies