Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

I'm using inline PTX ld.shared to load data from shared memory:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];  //declare a buffer in shared memory
float Csub = 0;

As[TY][TX] = A[a + wA * TY + TX];             //load data from global memory to shared memory
__syncthreads();
float t;
asm("ld.shared.f32 %0, [%1];" :"=f"(t) : "r"((int)&As[TY][k]));  //load data from shared memory into t
Csub += t;
__syncthreads();

But it runs into an error:

CUDA error at C:/ProgramData/NVIDIA Corporation/CUDA Samples/v11.2/0_Simple/matrixMul_mine/matrixMul.cu:196 code=700(cudaErrorIllegalAddress) "cudaStreamSynchronize(stream)" 

I dumped the SASS and found that the LDS happens even earlier than LDG and the two bar.sync instructions. It seems that the compiler looses track of the data dependency.

So my questions are:

  1. Is there anything wrong in my inline PTX that leads to cudaErrorIllegalAddress?
  2. Does inline PTX disturb the compilers ability to track data dependencies?
like image 725
Yichen Avatar asked Mar 10 '26 19:03

Yichen


1 Answers

Yichen's comment is right.

There are two types of addressing: ld. or ld.statespace.

If ld. is used on its own, the address should be a generic address. The generic address, to my limited understanding, is the CUDA-C++ pointer value, like &As[TY][k] in your code.

If ld.statespace is used, the address should be the address in the state space.

I think if you use ld.f32 instead of ld.shared.f32, your code should be okay. BTW, I don't think you can use the generic address in 32-bit data width, which can truncate the generic address into a wrong value.

Or you can convert the generic address to the shared space address. Here is CUTLASS' conversion code:

      ".reg .u32 smem_ptr32;\n\t"
      ".reg .u64 smem_ptr64; cvta.to.shared.u64 smem_ptr64, %1; cvt.u32.u64 smem_ptr32, smem_ptr64; \n\t"

Then use smem_ptr32 instead of [%1]:

"ld.shared.f32 %0, [smem_ptr32];"

As the PTX ISA says, this address can be either 32-bit or 64-bit. I think it's not necessary to convert the 64-bit pointer to a 32-bit pointer. Using smem_ptr64 should be okay.

Here is what the shared memory address could look like:

Pointer State Space Value
CUDA-C++ pointer Generic Space 1526743433216 + 1024
smem_ptr64 Shared Space 0 + 1024
like image 129
Mr.Ly Avatar answered Mar 13 '26 16:03

Mr.Ly