I've been struggling for some time a problem I can't seem to find a solution to. The problem is that when I try to debug my CUDA code using Nvidia Nsight under Visual Studio 2008 I get strange results when using shared memory.
My code is:
template<typename T>
__device__
T integrate()
{
extern __shared__ T s_test[]; // Dynamically allocated shared memory
/**** Breakpoint (1) here ****/
int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering
if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing
s_test[index] = (T)index;
/* Some other irelevant code here */
}
return v;
}
When I reach breakpoint 1 and inspect the shared memory inside Visual Studio Watch window only the first 8 values of the array change and the others remain null. I would expect all first 64 to do so.
I thought it might have something to do with all warps not executing simultaneously. So I tried synchronizing them. I added this code inside integrate()
template<typename T>
__device__
T integrate()
{
/* Old code is still here */
__syncthreads();
/**** Breakpoint (2) here ****/
if(index < 64 && blockIdx.x==0) {
T tmp = s_test[index]; // Write to tmp variable so I can inspect it inside Nsight Watch window
v = tmp + index; // Use `tmp` and `index` somehow so that the compiler doesn't optimize it out of existence
}
return v;
}
But the problem is still there. Furthermore the rest of the values inside tmp are not 0
as the Watch window form VS is indicating.
I must mention that it takes a lot of steps to step over __syncthreads()
, so when I reach it I just jump to breakpoint 2. What the heck is going on!?
EDIT Information about the system/launch configuration
System
Device GeForce 9500 GT
IDE
Compiler comands
1> "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin\nvcc.exe" -G -gencode=arch=compute_10,code=\"sm_10,compute_10\" --machine 32 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -D_NEXUS_DEBUG -g -D_DEBUG -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -I"inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\include" -maxrregcount=0 --compile -o "Debug/process_f2f.cu.obj" process_f2f.cu
Launch configuration. The shared memory size and doesn't seem to matter. I've tried several versions. The one I've worked with the most is:
Have you tried putting __syncthreads() after assigning the values?
template<typename T>
__device__
T integrate()
{
extern __shared__ T s_test[]; // Dynamically allocated shared memory
int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering
if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing
s_test[index] = (T)index;
/* Some other irelevant code here */
}
__syncthreads();
/**** Breakpoint (1) here ****/
return v;
}
And try to see the values at this breakpoint.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With