CUDA substituting __syncthreads instead of __threadfence() difference

Question

I have copied the below code from NVIDIA manual Eg: for __threadfence(). Why they have used __threadfence() in the below code. I think using __syncthreads() instead of __threadfence() will give you the same result.

Can someone explain the difference between __syncthreads() and __threadfence() calls?

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,float* result)
{
    // Each block sums a subset of the input array
    float partialSum = calculatePartialSum(array, N);

    if (threadIdx.x == 0) {
        // Thread 0 of each block stores the partial sum
        // to global memory
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure its result is visible to
        // all other threads
        __threadfence();

        // Thread 0 of each block signals that it is done
        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 of each block determines if its block is
        // the last block to be done
        isLastBlockDone = (value == (gridDim.x - 1));
    }

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone
    __syncthreads();

    if (isLastBlockDone) 
    {
        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0)
        {
            // Thread 0 of last block stores total sum
            // to global memory and resets count so that
            // next kernel call works properly
            result[0] = totalSum;
            count = 0;
        }
    }
}

CygnusX1 · Accepted Answer

In terms of shared memory __syncthreads() is simply stronger than __threadfence(). Regarding global memory - those are two different things.

__threadfence_block() stalls current thread until all writes to shared memory are visible to other threads from the same block. It prevents the compiler from optimising by caching shared memory writes in registers. It does not synchronise the threads and it is not necessary for all threads to actually reach this instruction.
__threadfence() stalls current thread until all writes to shared and global memory are visible to all other threads.
__syncthreads() must be reached by all threads from the block (e.g. no divergent if statements) and ensures that the code preceding the instruction is executed before the instructions following it, for all threads in the block.

In your particular case, the __threadfence() instruction is used to make sure that writes to global array result are visible to everyone. __syncthreads() would merely synchronise threads in the current block only, without enforcing the global memory writes for other block. What is more, at that point in the code you are inside an if branch, only one thread is executing that code; using __syncthreads() would result in an undefined behaviour of the GPU, most likely leading to complete desynchronisation of the kernel.

Check out the following chapters in the CUDA C Programming Guide:

3.2.2 "Shared Memory" - the example of matrix multiplication
5.4.3 "Synchronization Instruction"
B.2.5 "volatile"
B.5 "Memory Fence Functions"

CUDA substituting syncthreads instead of threadfence() difference

Tags:

cuda

kar

Video Answer

1 Answers

CygnusX1

Recent Activity

Donate For Us