I have copied the below code from NVIDIA manual Eg: for __threadfence()
. Why they have
used __threadfence()
in the below code. I think using __syncthreads()
instead of
__threadfence()
will give you the same result.
Can someone explain the difference between __syncthreads()
and __threadfence()
calls?
__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,float* result)
{
// Each block sums a subset of the input array
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum
// to global memory
result[blockIdx.x] = partialSum;
// Thread 0 makes sure its result is visible to
// all other threads
__threadfence();
// Thread 0 of each block signals that it is done
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 of each block determines if its block is
// the last block to be done
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone
__syncthreads();
if (isLastBlockDone)
{
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1]
float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0)
{
// Thread 0 of last block stores total sum
// to global memory and resets count so that
// next kernel call works properly
result[0] = totalSum;
count = 0;
}
}
}
In terms of shared memory __syncthreads()
is simply stronger than __threadfence()
. Regarding global memory - those are two different things.
__threadfence_block()
stalls current thread until all writes to shared memory are visible to other threads from the same block. It prevents the compiler from optimising by caching shared memory writes in registers. It does not synchronise the threads and it is not necessary for all threads to actually reach this instruction.__threadfence()
stalls current thread until all writes to shared and global memory are visible to all other threads.__syncthreads()
must be reached by all threads from the block (e.g. no divergent if
statements) and ensures that the code preceding the instruction is executed before the instructions following it, for all threads in the block.In your particular case, the __threadfence()
instruction is used to make sure that writes to global array result
are visible to everyone. __syncthreads()
would merely synchronise threads in the current block only, without enforcing the global memory writes for other block. What is more, at that point in the code you are inside an if
branch, only one thread is executing that code; using __syncthreads()
would result in an undefined behaviour of the GPU, most likely leading to complete desynchronisation of the kernel.
Check out the following chapters in the CUDA C Programming Guide:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With