Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The peak throughput of cuda Kernel on NVIDA GPU

I have a question about the throughput of a kernel running on a GPU. Assuming its occupancy is 0.5, block size is 256: the programming guide states that it is better to have many blocks so they can hide the memory latency, etc. But I don't understand why this is correct. Because as soon as the kernel has a number of warp per Streaming Multi-processor = 24, i.e., 3 blocks, it will reach the peak throughput. So having more than 24 warps (or 3 blocks) won't change anything to the throughput.

Am I missing anything? Can anyone correct me?

like image 764
Zk1001 Avatar asked Aug 06 '11 09:08

Zk1001


People also ask

Which GPU is best for CUDA?

The NVIDIA A6000 GPU offers the perfect blend of performance and price, making it the ideal choice for professionals. With its advanced CUDA architecture and 48GB of GDDR6 memory, the A6000 delivers stunning performance.

What is CUDA kernel?

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.

Which is faster OpenCL or CUDA?

A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.

Does CUDA improve performance?

Upgrading Your Graphics CardUsing a graphics card that comes equipped with CUDA cores will give your PC an edge in overall performance, as well as in gaming. More CUDA cores mean clearer and more lifelike graphics.


1 Answers

While it is true that low occupancy SMs cannot sufficiently hide latency, it is important to understand this:

Higher Occupancy != Higher Throughput!

Occupancy is simply a measure of how much work is available for the SM to choose from at any given instant. Having more resident warps gives the SM more ability to do useful work while other warps are waiting for results (results of memory accesses, or computations -- both have non-zero latency).

Throughput is a measure of how much work gets done per second, and while it can be limited by latency (and therefore occupancy), it also can be limited by memory bandwidth, instruction throughput (the number of execution units), and other factors.

The reason the programming guide states that it is better to have multiple thread blocks than just one large thread block is because sometimes it is better to be able to issue work from not just other warps but also other blocks. Here's an example:

Imagine that your big thread block has to load data from global memory (high latency) and store it in to shared memory (low latency), and then must immediately do a __syncthreads(). In this case, when a warp is finished loading its data and writing it to shared memory, it must then wait until all other threads in the block finish doing the same. For a large block, that can be quite a while. But if there are multiple smaller thread blocks occupying the SM, then the SM could switch and do work from the other blocks while waiting for the __syncthreads to be satisfied in the first block. This can help reduce GPU idle time and improve efficiency.

You don't necessarily want to have really tiny blocks (since the SMs on Fermi support at most 8 resident blocks), but having blocks of 128-512 threads is often more efficient than using blocks with 1024 threads.

like image 130
harrism Avatar answered Sep 17 '22 13:09

harrism