Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA: __restrict__ tag usage

I don't quite understand the concept of the __restrict__ tag in CUDA.

I've read that using __restrict__ avoids pointers aliasing and in particular, if the variable pointed at is read-only, the reading of the variable is optimized because it's cached.

This is a simplified version of the code:

__constant__ float M[M_DIM1][M_DIM2];

__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]);

__global__ void kernel_function(const float* __restrict__ N, float *P);

__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]) {

    int IOSize = DIM1 * DIM2 * sizeof(float);
    int ConstSize = M_DIM1* M_DIM2* sizeof(float);
    float* dN, *dP;
    cudaMalloc((void**)&dN, IOSize);
    cudaMemcpy(dN, N, IOSize, cudaMemcpyHostToDevice);

    cudaMemcpyToSymbol(M, h_M, ConstSize);

    cudaMalloc((void**)&dP, IOSize);

    dim3 dimBlock(DIM1, DIM2);
    dim3 dimGrid(1, 1);

    kernel_function << <dimGrid, dimBlock >> >(dN, dP);

    cudaMemcpy(P, dP, IOSize, cudaMemcpyDeviceToHost);

    cudaFree(dN);
    cudaFree(dP);

}

Am I using the __restrict__ tag on N, that is read-only, in the right way? Furthermore, I've read that the keyword __constant__ on M means that is read-only and constant, so what is the difference between the two of them, the type of allocation?

like image 772
Pleasant94 Avatar asked Mar 07 '26 09:03

Pleasant94


1 Answers

__restrict__ as used by nvcc is documented here. (note that various c++ compilers including gnu compilers also have support for this exact keyword, and use it similarly).

It has essentially the same semantics as the C99 restrict keyword, which is an official part of that language standard.

In a nutshell, __restrict__ is a contract that you as a programmer make with the compiler, which says, roughly, "I will only use this pointer to refer to the underlying data". One of the key things that this takes off the table from the compiler's perspective is pointer aliasing, which can prevent the compiler from being able to make various optimizations.

If you'd like a longer formal treatise on the exact definition of restrict or __restrict__, please refer to one of the links I've already given, or do some research.

So, __restrict__ is generally useful to compilers that support it, for optimization purposes.

For compute capability 3.5 or higher devices, these devices have a separate cache called the read only cache which is independent of normal L1 type caching.

If you use both __restrict__ and const to decorate global pointers passed to a kernel, then this is also a strong hint to the compiler, when generating code for cc3.5 and higher devices, to cause those global memory loads to flow through the read-only cache. This can provide application performance benefits, often with little other code refactoring. This doesn't guarantee usage of the read-only cache, and the compiler will often attempt to aggressively use the read only cache if it can satisfy the necessary conditions, even if you don't use these decorators.

__constant__ refers to a different hardware resource on the GPU. There are many differences:

  • __constant__ is available on all GPUs, the read-only cache only on cc3.5 and higher
  • memory allocated using the __constant__ tag (which is included on the line to designate the allocation of memory) is limited to a maximum of 64KB. The read-only cache has no such limit. We don't put __restrict__ on a line that allocates memory; it is used to decorate a pointer.
  • data cached in the read-only cache has the typical global memory access considerations - normally we want adjacent and contiguous access for best coalescing of global memory reads through the read-only cache. The __constant__ mechanism, OTOH, expects so-called uniform access for fastest performance. Uniform access essentially means that every thread in a warp is requesting data from the same location/address/index.

Both __constant__ memory, and global memory marked with const decorator on the pointer passed to kernel code, are read-only from the perspective of kernel code.

I don't see any obvious problems in the code you have shown, whether with use of __restrict__ or anything else. The only comment I would have is that for maximal benefit you may want to decorate both the N and P pointers in your kernel declaration/prototype with __restrict__, for maximal benefit, if that is your intent. (You would not decorate P with const, obviously.)

like image 178
Robert Crovella Avatar answered Mar 10 '26 00:03

Robert Crovella