I understand that Fermi GPUs support prefetching to L1 or L2 cache. However, in the CUDA reference manual I can not find any thing about it.
Dues CUDA allow my kernel code to prefetch specific data to a specific level of cache?
The prefetch proactively triggers asynchronous data migration to GPU before the data is accessed, which reduces page faults and, consequently, the overhead in handling page faults.
There are three key language extensions CUDA programmers can use—CUDA blocks, shared memory, and synchronization barriers. CUDA blocks contain a collection of threads. A block of threads can share memory, and multiple threads can pause until all threads reach a specified set of execution.
__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).
In order to run a kernel on the CUDA threads, we need two things. First, in the main() function of the program, we call the function to be executed by each thread on the GPU. This invocation is called Kernel Launch and with it we need provide the number of threads and their grouping.
Well not at instruction level but detailed information about prefetching in GPUs in here:
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
(paper in the the ACM symposium on microarchitecture 2010)
You can find instruction reference in nVIDIA's PTX ISA reference document; the relevant instructions are prefetch
and prefetchu
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With