CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

Question

I'm using inline PTX ld.shared to load data from shared memory:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];  //declare a buffer in shared memory
float Csub = 0;

As[TY][TX] = A[a + wA * TY + TX];             //load data from global memory to shared memory
__syncthreads();
float t;
asm("ld.shared.f32 %0, [%1];" :"=f"(t) : "r"((int)&As[TY][k]));  //load data from shared memory into t
Csub += t;
__syncthreads();

But it runs into an error:

CUDA error at C:/ProgramData/NVIDIA Corporation/CUDA Samples/v11.2/0_Simple/matrixMul_mine/matrixMul.cu:196 code=700(cudaErrorIllegalAddress) "cudaStreamSynchronize(stream)"

I dumped the SASS and found that the LDS happens even earlier than LDG and the two bar.sync instructions. It seems that the compiler looses track of the data dependency.

So my questions are:

Is there anything wrong in my inline PTX that leads to cudaErrorIllegalAddress?
Does inline PTX disturb the compilers ability to track data dependencies?

Mr.Ly · Accepted Answer

Yichen's comment is right.

There are two types of addressing: ld. or ld.statespace.

If ld. is used on its own, the address should be a generic address. The generic address, to my limited understanding, is the CUDA-C++ pointer value, like &As[TY][k] in your code.

If ld.statespace is used, the address should be the address in the state space.

I think if you use ld.f32 instead of ld.shared.f32, your code should be okay. BTW, I don't think you can use the generic address in 32-bit data width, which can truncate the generic address into a wrong value.

Or you can convert the generic address to the shared space address. Here is CUTLASS' conversion code:

      ".reg .u32 smem_ptr32;
	"
      ".reg .u64 smem_ptr64; cvta.to.shared.u64 smem_ptr64, %1; cvt.u32.u64 smem_ptr32, smem_ptr64; 
	"

Then use smem_ptr32 instead of [%1]:

"ld.shared.f32 %0, [smem_ptr32];"

As the PTX ISA says, this address can be either 32-bit or 64-bit. I think it's not necessary to convert the 64-bit pointer to a 32-bit pointer. Using smem_ptr64 should be okay.

Here is what the shared memory address could look like:

Pointer	State Space	Value
CUDA-C++ pointer	Generic Space	`1526743433216 + 1024`
`smem_ptr64`	Shared Space	`0 + 1024`

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

Tags:

inline

cuda

ptx

gpu-shared-memory

Yichen

1 Answers

Mr.Ly

Recent Activity

Donate For Us

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

Tags:

inline

cuda

ptx

gpu-shared-memory

Yichen

1 Answers

Mr.Ly

Related questions

Recent Activity

Donate For Us