I understand that in CUDA, 32 adjacent threads in the same block will be scheduled as a warp. But I frequently finds some tutorial CUDA codes that has multiple blocks with 1 thread per block. In this model, will 32 threads from 32 block be scheduled as a warp? If not, can I say this model is not as efficient as organizing into 32-threads per block? Thanks!
A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM. Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes.
Therefore, blocks are divided into warps of 32 threads for execution.
These passes are sequential to each other and thus increase the execution time. If threads in the same warp follow different paths of control flow, then we say that these threads diverge in their execution.
Threads are fundamentally executed in warps of 32 threads. Blocks are composed of 1 or more warps, and grid of 1 or more blocks.
No, threads from different blocks cannot be scheduled in the same warp. If you create grids of threadblocks with only a single thread, you're definitely not getting the full performance from the machine. It's less efficient than having 32 (or an integer multiple of 32) threads per block. A Fermi SM, for example has 32 warp lanes that can be in use. If you are scheduling blocks of a single thread, then only 1 of those 32 lanes can be in use at any given time.
Threads have a thread ID (threadIdx built-in variable) which is defined within (and unique only to) a single block.
The Hardware multithreading section of the C programming guide gives a formula which defines the total number of warps in a single block.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With