The peak throughput of cuda Kernel on NVIDA GPU

Tags:

I have a question about the throughput of a kernel running on a GPU. Assuming its occupancy is 0.5, block size is 256: the programming guide states that it is better to have many blocks so they can hide the memory latency, etc. But I don't understand why this is correct. Because as soon as the kernel has a number of warp per Streaming Multi-processor = 24, i.e., 3 blocks, it will reach the peak throughput. So having more than 24 warps (or 3 blocks) won't change anything to the throughput.

Am I missing anything? Can anyone correct me?

764

asked Aug 06 '11 09:08

Zk1001

1 Answers

While it is true that low occupancy SMs cannot sufficiently hide latency, it is important to understand this:

Higher Occupancy != Higher Throughput!

Occupancy is simply a measure of how much work is available for the SM to choose from at any given instant. Having more resident warps gives the SM more ability to do useful work while other warps are waiting for results (results of memory accesses, or computations -- both have non-zero latency).

Throughput is a measure of how much work gets done per second, and while it can be limited by latency (and therefore occupancy), it also can be limited by memory bandwidth, instruction throughput (the number of execution units), and other factors.

The reason the programming guide states that it is better to have multiple thread blocks than just one large thread block is because sometimes it is better to be able to issue work from not just other warps but also other blocks. Here's an example:

Imagine that your big thread block has to load data from global memory (high latency) and store it in to shared memory (low latency), and then must immediately do a __syncthreads(). In this case, when a warp is finished loading its data and writing it to shared memory, it must then wait until all other threads in the block finish doing the same. For a large block, that can be quite a while. But if there are multiple smaller thread blocks occupying the SM, then the SM could switch and do work from the other blocks while waiting for the __syncthreads to be satisfied in the first block. This can help reduce GPU idle time and improve efficiency.

You don't necessarily want to have really tiny blocks (since the SMs on Fermi support at most 8 resident blocks), but having blocks of 128-512 threads is often more efficient than using blocks with 1024 threads.

130

answered Sep 17 '22 13:09

harrism

Related questions
                            
                                The cost of CUDA global memory transactions
                            
                                Will 32 threads from 32 block be scheduled as a warp?
                            
                                How to differentiate between pointers to shared and global memory?
                            
                                Difference between memcpy_htod and to_gpu in Pycuda?
                            
                                Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture
                            
                                Miscellaneous and Inter-Thread Communication Instructions in CUDA
                            
                                openCV 2.4.9 compilation error with CUDA 6.5
                            
                                Why launch a multiple of 32 number of threads in CUDA?
                            
                                CPU memory access latency of data allocated with malloc() vs. cudaHostAlloc() on Tegra TK1
                            
                                Issues with compiling Caffe with cuDNN
                            
                                Why should I use CUDA __shared__ memory as "extern"
                            
                                Warning when compiling .cu with -Wpedantic: "style of line directive is a GCC extension"
                            
                                Performance of CUDAfy module
                            
                                How do you include standard CUDA libraries to link with NVRTC code?
                            
                                How to add more than one CUDA gencode using modern CMAKE (per target)?
                            
                                What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?
                            
                                Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks
                            
                                CUDA - what is this loop doing
                            
                                shared memory optimization confusion
                            
                                PyCUDA: Pow within device code tries to use std::pow, fails

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The peak throughput of cuda Kernel on NVIDA GPU

Tags:

cuda

gpgpu

gpu

opencl

Zk1001

People also ask

1 Answers

harrism

Recent Activity

Donate For Us