I am working on data prefetch in CUDA (Fermi GPU) through C code. Cuda reference manual talks about the prefetching at ptx level code not at C level code.
Can anyone connect me with some documents or something regarding prefetching through cuda code (cu file). Any help would be appreciated.
By using CUDA Unified Memory, GPU can use data stored in physically different locations without explicitly calling memory copy functions such as cudaMemcpy() and cudaMemcpyAsync().
On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made1.
Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.
According to PTX manual here is how prefetch works in PTX:
You can embed the PTX instructions into the CUDA kernel. Here is a tiny sample from NVIDIA's documentation:
__device__ int cube (int x)
{
int y;
asm("{\n\t" // use braces for local scope
" .reg .u32 t1;\n\t" // temp reg t1,
" mul.lo.u32 t1, %1, %1;\n\t" // t1 = x * x
" mul.lo.u32 %0, t1, %1;\n\t" // y = t1 * x
"}"
: "=r"(y) : "r" (x));
return y;
}
You may come to conclude with the following prefetch function in C:
__device__ void prefetch_l1 (unsigned int addr)
{
asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}
NOTICE: You need the GPU of Compute Capability 2.0 or higher for prefetch. Pass proper compile flags accordingly -arch=sm_20
According to this thread, below is the code for different cache prefetching techniques:
#define DEVICE_STATIC_INTRINSIC_QUALIFIERS static __device__ __forceinline__
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
#define PXL_GLOBAL_PTR "l"
#else
#define PXL_GLOBAL_PTR "r"
#endif
DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l1(const void* const ptr)
{
asm("prefetch.global.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}
DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_uniform(const void* const ptr)
{
asm("prefetchu.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}
DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l2(const void* const ptr)
{
asm("prefetch.global.L2 [%0];" : : PXL_GLOBAL_PTR(ptr));
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With