CUDA synchronization and reading global memory

Question

I have something like this:

__global__ void globFunction(int *arr, int N) {
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;
    // calculating and Writing results to arr ...
    __syncthreads();
    // reading values of another threads(ex i+1)
    int val = arr[idx+1]; // IT IS GIVING OLD VALUE
}


int main() {
    // declare array, alloc memory, copy memory, etc.
    globFunction<<< 4000, 256>>>(arr, N); 
    // do something ...
    return 0;
}

Why am I getting the old value when I read arr[idx+1]? I called __syncthreads, so I expect to see the updated value. What did I do wrong? Am I reading a cache or what?

Jason · Accepted Answer

Using the __syncthreads() function only synchronizes the threads in the current block. In this case this would be the 256 threads per block you created when you launched the kernel. So in your given array, for each index value that crosses over into another block of threads, you'll end up reading a value from global memory that is not synchronized with respect to the threads in the current block.

One thing you can do to circumvent this issue is create shared thread-local storage using the __shared__ CUDA directive that allows the threads in your blocks to share information among themselves, but prevents threads from other blocks accessing the memory allocated for the current block. Once your calculation within the block is complete (and you can use __syncthreads() for this task), you can then copy back into the globally accessible memory the values in the shared block-level storage.

Your kernel could look something like:

__global__ void globFunction(int *arr, int N) 
{
    __shared__ int local_array[THREADS_PER_BLOCK];  //local block memory cache           
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;

    //...calculate results
    local_array[threadIdx.x] = results;

    //synchronize the local threads writing to the local memory cache
    __syncthreads();

    // read the results of another thread in the current thread
    int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];

    //write back the value to global memory
    arr[idx] = val;        
}

If you must synchronize threads across blocks, you should be looking for another way to solve your problem, since the CUDA programing model works most effectively when a problem can be broken down into blocks, and threads synchronization only needs to take place within a block.

CUDA synchronization and reading global memory

Tags:

cuda

nosbor

1 Answers

Jason

Recent Activity

Donate For Us

CUDA synchronization and reading global memory

Tags:

cuda

nosbor

1 Answers

Jason

Related questions

Recent Activity

Donate For Us