I have a question about the throughput of a kernel running on a GPU. Assuming its occupancy is 0.5, block size is 256: the programming guide states that it is better to have many blocks so they can hide the memory latency, etc. But I don't understand why this is correct. Because as soon as the kernel has a number of warp per Streaming Multi-processor = 24, i.e., 3 blocks, it will reach the peak throughput. So having more than 24 warps (or 3 blocks) won't change anything to the throughput.
Am I missing anything? Can anyone correct me?
The NVIDIA A6000 GPU offers the perfect blend of performance and price, making it the ideal choice for professionals. With its advanced CUDA architecture and 48GB of GDDR6 memory, the A6000 delivers stunning performance.
Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.
A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.
Upgrading Your Graphics CardUsing a graphics card that comes equipped with CUDA cores will give your PC an edge in overall performance, as well as in gaming. More CUDA cores mean clearer and more lifelike graphics.
While it is true that low occupancy SMs cannot sufficiently hide latency, it is important to understand this:
Higher Occupancy != Higher Throughput!
Occupancy is simply a measure of how much work is available for the SM to choose from at any given instant. Having more resident warps gives the SM more ability to do useful work while other warps are waiting for results (results of memory accesses, or computations -- both have non-zero latency).
Throughput is a measure of how much work gets done per second, and while it can be limited by latency (and therefore occupancy), it also can be limited by memory bandwidth, instruction throughput (the number of execution units), and other factors.
The reason the programming guide states that it is better to have multiple thread blocks than just one large thread block is because sometimes it is better to be able to issue work from not just other warps but also other blocks. Here's an example:
Imagine that your big thread block has to load data from global memory (high latency) and store it in to shared memory (low latency), and then must immediately do a __syncthreads()
. In this case, when a warp is finished loading its data and writing it to shared memory, it must then wait until all other threads in the block finish doing the same. For a large block, that can be quite a while. But if there are multiple smaller thread blocks occupying the SM, then the SM could switch and do work from the other blocks while waiting for the __syncthreads
to be satisfied in the first block. This can help reduce GPU idle time and improve efficiency.
You don't necessarily want to have really tiny blocks (since the SMs on Fermi support at most 8 resident blocks), but having blocks of 128-512 threads is often more efficient than using blocks with 1024 threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With