If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?
Note: No extra threads or blocks, just a single warp for the kernel.
Example code:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];
Synchronization between Threads The CUDA API has a method, __syncthreads() to synchronize threads. When the method is encountered in the kernel, all threads in a block will be blocked at the calling location until each of them reaches the location.
NVIDIA GPUs execute warps of 32 parallel threads using SIMT, which enables each thread to access its own registers, to load and store from divergent addresses, and to follow divergent control flow paths.
In CUDA, groups of threads with consecutive thread indexes are bundled into warps; one full warp is executed on a single CUDA core. At runtime, a thread block is divided into a number of warps for execution on the cores of an SM. The size of a warp depends on the hardware.
A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction.
Updated with more information about using volatile
Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads()
, a practice known as "warp-synchronous programming". However, there are a few things to look out for.
__syncthreads()
acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile
causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
__syncthreads()
to conform with the CUDA Programming ModelwarpSize
in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With