CUDA. How to unroll first 32 threads so they will be executed in parallel?

Question

I know that "each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0" so first 32 threads should be in the first warp. Also I know that all threads in one warp are executed simultaneously on any available Streaming Multiprocessor.

As I understood, because of that there is no need in thread syncing if only one warp is being executed. But code below produces wrong answer if I remove any of __syncthreads() in penultimate if block. I tried to find cause but have ended up with nothing. I really hope for your help, so you could tell me what is wrong with this code? Why I can't leave only last __syncthreads() and get right answer?

#define BLOCK_SIZE 128

__global__ void reduce ( int * inData, int * outData )
{
 __shared__ int data [BLOCK_SIZE]; 
 int tid = threadIdx.x; 
 int i   = blockIdx.x * blockDim.x + threadIdx.x; 

 data [tid] = inData [i] + inData [i + blockDim.x / 2 ];
 __syncthreads ();

 for ( int s = blockDim.x / 4; s > 32; s >>= 1 ) 
 {
  if ( tid < s ) 
   data [tid] += data [tid + s]; 
  __syncthreads (); 
 } 

 if ( tid < 32 )
 { 
  data [tid] += data [tid + 32];
  __syncthreads (); 
  data [tid] += data [tid + 16];
  __syncthreads (); 
  data [tid] += data [tid + 8];
  __syncthreads (); 
  data [tid] += data [tid + 4];
  __syncthreads (); 
  data [tid] += data [tid + 2];
  __syncthreads (); 
  data [tid] += data [tid + 1];
  __syncthreads (); 
 }
 if ( tid == 0 )
  outData [blockIdx.x] = data [0];
}

void main()
{
...
 reduce<<<dim3(128), dim3(128)>>>(dev_data, dev_res);
...
}

P.S. I am using GT560Ti

talonmies · Accepted Answer

You should declare the shared memory variable as volatile:

__shared__ volatile int data [BLOCK_SIZE];

The problem you are seeing is an artifact of the Fermi architecture and compiler optimisation. The Fermi architecture lacks instructions to directly operate on shared memory (they were present in the G80/90/GT200 series). So everything is loaded to register, manipulated, and stored back to shared memory. But the compiler is free to deduce that code could be made faster if a series of operations were staged in register, without intermediate loads and stores from/to shared memory. This is perfectly fine except when you are relying on implicit synchronisation of threads within the same warp manipulating shared memory, as in this sort of reduction code.

By declaring the shared memory buffer as volatile, you are forcing the compiler to enforce the shared memory write after each stage of the reduction, and the implicit data synchronisation between threads within the warp is restored.

This issue is discussed in the programming notes for Fermi which ships (or perhaps shipped) with the CUDA toolkit.

CUDA. How to unroll first 32 threads so they will be executed in parallel?

Tags:

c++

cuda

gpu

Виталий Хабаров

1 Answers

talonmies

Recent Activity

Donate For Us

CUDA. How to unroll first 32 threads so they will be executed in parallel?

Tags:

c++

cuda

gpu

Виталий Хабаров

1 Answers

talonmies

Related questions

Recent Activity

Donate For Us