I have several blocks, each having some integers in a shared memory array of size 512. How can I check if the array in every block contains a zero as an element?
What I am doing is creating an array that resides in the global memory. The size of this array depends on the number of blocks, and it is initialized to 0. Hence every block writes to a[blockid] = 1
if the shared memory array contains a zero.
My problem is when I have several threads in a single block writing at the same time. That is, if the array in the shared memory contains more than one zero, then several threads will write a[blockid] = 1
. Would this generate any problem?
In other words, would it be a problem if 2 threads write the exact same value to the exact same array element in global memory?
For a CUDA program, if multiple threads in a warp write to the same location then the location will be updated but it is undefined how many times the location is updated (i.e. how many actual writes occur in series) and it is undefined which thread will write last (i.e. which thread will win the race).
For devices of compute capability 2.x, if multiple threads in a warp write to the same address then only one thread will actually perform the write, which thread is undefined.
From the CUDA C Programming Guide section F.4.2:
If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.
See also section 4.1 of the guide for more info.
In other words, if all threads writing to a given location write the same value, then it is safe.
In the CUDA execution model, there are no guarantees that every simultaneous write from threads in the same block to the same global memory location will succeed. At least one write will work, but it isn't guaranteed by the programming model how many write transactions will occur, or in what order they will occur if more than one transaction is executed.
If this is a problem, then a better approach (from a correctness point of view), would be to have only one thread from each block do the global write. You can either use a shared memory flag set atomically or a reduction operation to determine whether the value should be set. Which you choose might depend on how many zeros there are likely to be. The more zeroes there are, the more attractive the reduction will be. CUDA includes warp level __any()
and __all()
operators which can be built into a very efficient boolean reduction in a few lines of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With