If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp? Note: No extra threads or blocks, just a single warp for the kernel. Example code: <pre class="prettyprint"><code>shared _voltatile_ sdata[16]; int index = some_number_between_0_and_15; sdata[tid] = some_number; output[tid] = x ^ y ^ z ^ sdata[index]; </code></pre>

Updated with more information about using volatile Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the <code>__syncthreads()</code>, a practice known as "warp-synchronous programming". However, there are a few things to look out for. <ul> <li>Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. <code>__syncthreads()</code> acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using <code>volatile</code> causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future) <ul> <li>Technically, you should always use <code>__syncthreads()</code> to conform with the CUDA Programming Model</li> </ul> </li> <li>The warp size is and always has been 32, but you can: <ul> <li>At compile time use the special variable <code>warpSize</code> in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)</li> <li>At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)</li> </ul> </li> </ul> Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.

CUDA __syncthreads() usage within a warp

Tags:

synchronization

parallel-processing

cuda

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?

Note: No extra threads or blocks, just a single warp for the kernel.

Example code:

shared _voltatile_ sdata[16];

int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];

315

asked Apr 18 '12 07:04

sj755

1 Answers

Updated with more information about using volatile

Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.

Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
- Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
The warp size is and always has been 32, but you can:
- At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
- At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)

Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.

answered Oct 01 '22 08:10

Tom

Related questions
                            
                                run commands in parallel with exit fail if any command fails
                            
                                Multithreading on SLURM
                            
                                Parallel Stream gives null items, How to do in Java 8
                            
                                parallel.foreach and httpclient - strange behaviour
                            
                                Intermediate results from joblib
                            
                                sizeof(MPI_INT) different from sizeof(int)
                            
                                Java Stream stateful behavior example
                            
                                Why does my thread not run in background?
                            
                                is R creating too many threads on startup
                            
                                Quartz .NET - Prevent parallel Job Execution
                            
                                How does just-in-time compiler optimizes Java parallel streams?
                            
                                How Synchronized and Concurrent Collections are thread-safe but their content not
                            
                                mutually-exclusive job scheduling in GNU make?
                            
                                javascript parallelism
                            
                                Parallel.For in C#
                            
                                Parallel processing and temporary files
                            
                                Amdahl's Law examples
                            
                                distributed transactions and queues, ruby, erlang, scala
                            
                                Jenkins (Hudson) - Managing dependencies between parallel builds
                            
                                gcc openmp thread reuse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With