Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prefetch in cuda (through C code)

Tags:

cuda

prefetch

I am working on data prefetch in CUDA (Fermi GPU) through C code. Cuda reference manual talks about the prefetching at ptx level code not at C level code.

Can anyone connect me with some documents or something regarding prefetching through cuda code (cu file). Any help would be appreciated.

like image 502
user1805482 Avatar asked Nov 07 '12 08:11

user1805482


People also ask

Does CUDA use GPU memory?

By using CUDA Unified Memory, GPU can use data stored in physically different locations without explicitly calling memory copy functions such as cudaMemcpy() and cudaMemcpyAsync().

What is cudaMallocManaged?

On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made1.

Can CUDA use shared memory?

Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.


2 Answers

According to PTX manual here is how prefetch works in PTX:

enter image description here

You can embed the PTX instructions into the CUDA kernel. Here is a tiny sample from NVIDIA's documentation:

__device__ int cube (int x)
{
  int y;
  asm("{\n\t"                       // use braces for local scope
      " .reg .u32 t1;\n\t"           // temp reg t1,
      " mul.lo.u32 t1, %1, %1;\n\t" // t1 = x * x
      " mul.lo.u32 %0, t1, %1;\n\t" // y = t1 * x
      "}"
      : "=r"(y) : "r" (x));
  return y;
}

You may come to conclude with the following prefetch function in C:

__device__ void prefetch_l1 (unsigned int addr)
{

  asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}

NOTICE: You need the GPU of Compute Capability 2.0 or higher for prefetch. Pass proper compile flags accordingly -arch=sm_20

like image 179
lashgar Avatar answered Oct 01 '22 05:10

lashgar


According to this thread, below is the code for different cache prefetching techniques:

#define DEVICE_STATIC_INTRINSIC_QUALIFIERS  static __device__ __forceinline__

#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
#define PXL_GLOBAL_PTR   "l"
#else
#define PXL_GLOBAL_PTR   "r"
#endif

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l1(const void* const ptr)
{
  asm("prefetch.global.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_uniform(const void* const ptr)
{
  asm("prefetchu.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l2(const void* const ptr)
{
  asm("prefetch.global.L2 [%0];" : : PXL_GLOBAL_PTR(ptr));
}
like image 23
Serge Rogatch Avatar answered Oct 01 '22 05:10

Serge Rogatch