I have something like this:
__global__ void globFunction(int *arr, int N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
// calculating and Writing results to arr ...
__syncthreads();
// reading values of another threads(ex i+1)
int val = arr[idx+1]; // IT IS GIVING OLD VALUE
}
int main() {
// declare array, alloc memory, copy memory, etc.
globFunction<<< 4000, 256>>>(arr, N);
// do something ...
return 0;
}
Why am I getting the old value when I read arr[idx+1]
? I called __syncthreads
, so I expect to see the updated value. What did I do wrong? Am I reading a cache or what?
Using the __syncthreads()
function only synchronizes the threads in the current block. In this case this would be the 256 threads per block you created when you launched the kernel. So in your given array, for each index value that crosses over into another block of threads, you'll end up reading a value from global memory that is not synchronized with respect to the threads in the current block.
One thing you can do to circumvent this issue is create shared thread-local storage using the __shared__
CUDA directive that allows the threads in your blocks to share information among themselves, but prevents threads from other blocks accessing the memory allocated for the current block. Once your calculation within the block is complete (and you can use __syncthreads()
for this task), you can then copy back into the globally accessible memory the values in the shared block-level storage.
Your kernel could look something like:
__global__ void globFunction(int *arr, int N)
{
__shared__ int local_array[THREADS_PER_BLOCK]; //local block memory cache
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
//...calculate results
local_array[threadIdx.x] = results;
//synchronize the local threads writing to the local memory cache
__syncthreads();
// read the results of another thread in the current thread
int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
//write back the value to global memory
arr[idx] = val;
}
If you must synchronize threads across blocks, you should be looking for another way to solve your problem, since the CUDA programing model works most effectively when a problem can be broken down into blocks, and threads synchronization only needs to take place within a block.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With