I have two questions regarding __syncwarp() in CUDA:
__syncwarp() do, and why is it necessary?__syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized?
No. There can be warp level execution divergence (usually branching, but can be other things like warp shuffles, voting, and predicated execution), handled by instruction replay or execution masking. Note that in "modern" CUDA, implicit warp synchronous programming is no longer safe, thus warp level synchronization is not just desirable, it is mandatory.
If so, what exactly does __syncwarp() do, and why is it necessary?
Because there can be warp level execution divergence, and this is how synchronization within a divergent warp is achieved.
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?
By ensuring that the split is always performed explicitly using calculated warp boundaries (or a suitable thread mask).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With