Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why bother to know about CUDA Warps?

Tags:

cuda

gpu

gpu-warp

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores. It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp. That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?

And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> (). But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?

like image 546
Alex Avatar asked Aug 05 '12 13:08

Alex


People also ask

What are CUDA warps?

In CUDA, groups of threads with consecutive thread indexes are bundled into warps; one full warp is executed on a single CUDA core. At runtime, a thread block is divided into a number of warps for execution on the cores of an SM. The size of a warp depends on the hardware.

What is the relationship between warps thread blocks and CUDA cores?

All the threads in a warp executes concurrently on the resources of the SM. The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores. If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.

How many threads can reside in a warp commonly )?

A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction.

What is a warp in parallel programming?

Warps are the base unit used to schedule both computation on Arithmetic and Logic Units (ALUs) and memory accesses. Threads within the same warp follow the SIMD pattern, i.e. they are supposed to execute the same operation at a given clock cycle...


1 Answers

Overview of a GTX460 SM

The situation is quite a bit more complicated than what you describe.

The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.

The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.

The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.

Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.

This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".

So, coming to your actual question,

why bother to know about Warps?

Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:

  • In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.

  • In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.

  • In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.

like image 129
Roger Dahl Avatar answered Sep 23 '22 08:09

Roger Dahl