i am doing a research about GPU programming and have a question about modifying global array in thread.
__device__ float data[10] = {0,0,0,0,0,0,0,0,0,1};
__global__ void gradually_set_global_data() {
while (1) {
if (data[threadIdx.x + 1]) {
atomicAdd(&data[threadIdx.x], data[threadIdx.x + 1]);
break;
}
}
}
int main() {
gradually_set_global_data<<<1, 9>>>();
cudaDeviceReset();
return 0;
}
The kernel should complete execution with data
expected to hold [1,1,1,1,1,1,1,1,1,1], but it gets stuck in an endless loop. Why does this happen?
The GPU is called a device and GPU memory likewise called device memory. To execute any CUDA program, there are three main steps: Copy the input data from host memory to device memory, also known as host-to-device transfer. Load the GPU program and execute, caching data on-chip for performance.
__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).
Correct option is BParallel portions of an application are executed on the device as.
In order to run a kernel on the CUDA threads, we need two things. First, in the main() function of the program, we call the function to be executed by each thread on the GPU. This invocation is called Kernel Launch and with it we need provide the number of threads and their grouping.
TL;DR: The code is broken by inspection. The CUDA threading model does not guarantee forward progress of any particular thread except in compliance to the following:
Which thread or threads will be chosen for item 1 are undefined by the CUDA programming model. Unless the programmer takes explicit control using execution barriers, the CUDA threading model is free to schedule a single thread for as long as it likes, until that thread retires, or hits an explicit execution barrier.
Since the supplied code has no execution barriers, the CUDA work scheduler (with respect to CUDA semantics) is free to schedule, for example, thread 0, and no other threads. If we apply that concept to the supplied code, it's evident that thread 0, if run by itself, will exhibit an infinite loop.
Longer:
That just happens to be the observed behavior, although if it were me, I would not connect the two. The reason for the hang (according to the way I would try to describe it) is not "for correctness, this code depends on guarantees that are not provided by the CUDA programming model" although I believe that is a true statement. To understand the reason for the hang, I suggest its necessary to inspect low-level machine behavior with the SASS (machine assembly code) in view. I don't really have the capability to pursue this topic to exhaustion, so I will present a limited view of this.
Why draw this distinction? Because relatively small changes to the supplied code, which don't actually address the correctness issue, may result in the compiler generating code that doesn't hang. Lack of careful treatment might lead one to conclude that because it does not hang, it must be OK. The point is that whether or not the code hangs is distinct from whether or not it is correct. I've proved this to myself. I don't wish to provide that code however. The right thing to do is design correct code. See below for my attempt at that.
Before we dive into the SASS, I would like to point out another flaw in the code. The CUDA compiler is free to "optimize" any global data into registers, while maintaining single thread semantic correctness. The compiler mostly has a single thread in view, and so this can trip up programmers who are depending on inter-thread communication (as this code does). For correctness, in this code, data modified by thread x must be visible (eventually) to thread x-1. This sort of inter-thread visibility is not guaranteed by the CUDA programming model, nor does the compiler generally enforce it. For correctness, its necessary to inform the compiler to make this data visible, and order loads and stores to make that happen. There are multiple methods to accomplish this. I will suggest marking the data with volatile
for simplicity, although it might be possible to do this with execution barriers (e.g. __syncthreads()
, __syncwarp()
) that also have memory barriers built-in. Regardless of the method chosen to enforce inter-thread data visibility, without it, the code is broken, independent of any other considerations.
So prior to diving into SASS, I'll suggest the following modification to the supplied code, along with the SASS following it:
$ cat t1691.cu
__device__ volatile float data[10] = {0,0,0,0,0,0,0,0,0,1};
__global__ void gradually_set_global_data() {
while (1) {
if (data[threadIdx.x + 1]) {
atomicAdd((float *)&data[threadIdx.x], data[threadIdx.x + 1]);
break;
}
}
}
int main() {
gradually_set_global_data<<<1, 9>>>();
cudaDeviceReset();
return 0;
}
$ nvcc -o t1691 t1691.cu
$ cuobjdump -sass ./t1691
Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_30
Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_30
Function : _Z25gradually_set_global_datav
.headerflags @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
/* 0x22f2c04272004307 */
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x2800400110005de4 */
/*0010*/ S2R R0, SR_TID.X; /* 0x2c00000084001c04 */
/*0018*/ MOV32I R3, 0x0; /* 0x180000000000dde2 */
/*0020*/ SSY 0x68; /* 0x6000000100001c07 */
/*0028*/ IMAD R2.CC, R0, 0x4, R3; /* 0x2007c00010009ca3 */
/*0030*/ MOV32I R3, 0x0; /* 0x180000000000dde2 */
/*0038*/ IMAD.U32.U32.HI.X R3, R0, 0x4, R3; /* 0x2086c0001000dc43 */
/* 0x22f043f2f2e2c3f7 */
/*0048*/ LD.E.CV R0, [R2+0x4]; /* 0x8400000010201f85 */
/*0050*/ FSETP.NEU.AND P0, PT, R0, RZ, PT; /* 0x268e0000fc01dc00 */
/*0058*/ @!P0 BRA 0x40; /* 0x4003ffff800021e7 */
/*0060*/ NOP.S; /* 0x4000000000001df4 */
/*0068*/ LD.E.CV R4, [R2+0x4]; /* 0x8400000010211f85 */
/*0070*/ RED.E.ADD.F32.FTZ.RN [R2], R4; /* 0x2c00000000211e05 */
/*0078*/ EXIT; /* 0x8000000000001de7 */
/*0080*/ BRA 0x80; /* 0x4003ffffe0001de7 */
/*0088*/ NOP; /* 0x4000000000001de4 */
/*0090*/ NOP; /* 0x4000000000001de4 */
/*0098*/ NOP; /* 0x4000000000001de4 */
/*00a0*/ NOP; /* 0x4000000000001de4 */
/*00a8*/ NOP; /* 0x4000000000001de4 */
/*00b0*/ NOP; /* 0x4000000000001de4 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
.........................................
Fatbin ptx code:
================
arch = sm_30
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$
According to my testing on cc3.5 and cc7.0 devices, the above code still hangs, so we have not modified its observed behavior with these changes. (note that above SASS code is for cc3.0, compiled with CUDA 10.1.243).
The code will exhibit warp-divergent behavior, and IMO this is essential to understanding the hang, so we will focus on the conditional area of the SASS code:
/*0038*/ IMAD.U32.U32.HI.X R3, R0, 0x4, R3; /* 0x2086c0001000dc43 */
/* 0x22f043f2f2e2c3f7 */
/*0048*/ LD.E.CV R0, [R2+0x4]; /* 0x8400000010201f85 */
/*0050*/ FSETP.NEU.AND P0, PT, R0, RZ, PT; /* 0x268e0000fc01dc00 */
/*0058*/ @!P0 BRA 0x40; /* 0x4003ffff800021e7 */
/*0060*/ NOP.S; /* 0x4000000000001df4 */
/*0068*/ LD.E.CV R4, [R2+0x4]; /* 0x8400000010211f85 */
/*0070*/ RED.E.ADD.F32.FTZ.RN [R2], R4; /* 0x2c00000000211e05 */
/*0078*/ EXIT; /* 0x8000000000001de7 */
By line 0038, all the setup work has been completed. At line 0048, the thread is loading its __device__ data
value from global memory (the .CV
on the LD
instruction is the result of our volatile
decoration), and the conditional test is performed in line 0050 and the conditional branch in line 0058. If the thread has picked up a non-zero value, then it will proceed to line 0060 (and, eventually, perform the atomic operation and exit). If not, it will go back to line 0040 to repeat the load and test.
Now, what we observe is a hang. The threads that pass and the threads that fail the conditional test are not scheduled at the same time by the warp scheduler. It must schedule one group (e.g. passing) or the other (e.g. failing). This same decision must be made by the warp scheduler repeatedly. If we observe a hang, the only possible conclusion is that the threads that fail the conditional test are repeatedly scheduled (chosen for issue) and the threads that pass the conditional test are not getting scheduled.
This is legal, and any conclusions that the passing threads should get scheduled "eventually" are invalid conclusions, according to the CUDA programming model and this code design. The only way to guarantee that the passing threads get scheduled is to hand the warp scheduler a situation such that there are no other choices available to it, in keeping with principle 1 at the top of this answer.
(Aside: Note that we might also have observed that the warp scheduler chose the passing threads instead of the failing threads to schedule/issue. In that case, because those passing threads eventually exit/retire in this realization, I expect that would have resulted in a code that does not hang. The passing threads would eventually all retire, and the warp scheduler would be forced, by item 1 at the top of this answer, to begin to schedule the failing threads. Not hanging here would be an equally valid and possible observation, to the extent that warp scheduling characteristics are outlined here. But basing any correctness conclusions on that outcome would still be wrong.)
Extending this idea, then, one might ask "is there a legal way to realize this pattern?" I will suggest that we now know that we are likely going to need execution barriers if we are going to make this work. Let's choose __syncwarp()
. For that barrier, legal usage of the barrier will generally require that we have a fully intact warp (or warps). So we will need to recast the code to allow a full warp to be active, but only the desired threads (a total of 9) doing the "work".
What follows is one possible way to achieve that. I'm sure there are other ways. This code, according to my testing, does not hang on cc3.5 or cc7.0 devices:
__device__ volatile float data[10] = {0,0,0,0,0,0,0,0,0,1};
__global__ void gradually_set_global_data(int sz) {
int tflag = (threadIdx.x < sz) ? 1:0; // choose the needed threads to do the "work"
unsigned wflag = 1; // initially, the entire warp is marked active
while (wflag) { // run the entire warp, or exit the entire warp
if (tflag) // if this thread still needs to do its "work"
if (data[threadIdx.x + 1]) {
atomicAdd((float *)&data[threadIdx.x], data[threadIdx.x + 1]);
tflag = 0; // the work for this thread is completed
}
__syncwarp();
wflag = __ballot_sync(0xFFFFFFFFU, tflag); //deactivate warp when all threads done
}
}
int main() {
gradually_set_global_data<<<1, 32>>>(9);
cudaDeviceReset();
return 0;
}
Note that if we wanted to heel even closer to the supplied code, the above could be recast with a while(1)
loop, and inside the loop issue a break
if wflag
is zero (after the ballot operation). I don't think there is any meaningful difference in that realization.
I still don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the deficiencies that I found in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With