I am following this article about a prediction model for GPU. In page 5 second column almost at the end they state <blockquote> One has to finally take care of the fact that each of the Nc cores(SPs) in an SM on the GPU has a D-deep pipeline that has the effect of executing D threads in parallel. </blockquote> My question is related to the D-deep pipeline. What does this pipeline look like? Is it something similar to the pipeline of the CPU (I mean only the idea because GPU-CPU are architectures completely different) about the fetch, decode, execute, write-back? Is there a doc where this is documented?

Yes, GPU SM's pipeline looks bit like CPU's. The difference is in frontend/backend proportions of the pipeline: GPU has single fetch/decode and a lot of small ALU (think as there are 32 parallel Execute subpipelines), grouped as "Cuda cores" inside the SM. This is similar to superscalar CPUs (e.g. Core-i7 has 6-8 issue ports, one port per independent ALU pipeline). There is GTX 460 SM (image from destructoid.com; we can even see what is inside each CUDA core two pipelines: Dispatch port, then Operand collector, then two parallel Units, one for Int and other for FP and the Result queue): <img src="https://i.stack.imgur.com/kvu4M.jpg" alt="GTX 460 SM"> (or better quality image http://www.legitreviews.com/images/reviews/1193/sm.jpg from http://www.legitreviews.com/article/1193/2/) We see that there is one Instruction cache in this SM, two warp schedulers and 4 dispatch units. And there is single register file. So, first stages of GPU SM pipeline are common resource of SM. After instruction planning they are dispatched to CUDA cores, and each core may have its own multistaged (pipelined) ALU, especially for complex operations. Length of the pipeline is hidden inside the architecture, but I assume that total pipeline depth is much more than 4. (There is clearly instructions with 4 clock ticks latency so ALU pipeline is >= 4 stages and total SM pipeline depth is assumed to be more than 20 stages: https://devtalk.nvidia.com/default/topic/390366/instruction-latency/ ) There is some additional info about instruction full latencies: https://devtalk.nvidia.com/default/topic/419456/how-to-schedule-warps-/ - 24-28 clocks for SP and 48-52 clocks for DP. Anandtech posted some pictures of AMD GPU, and we can assume that main ideas of pipelining should be similar for both vendors: http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4 <img src="https://i.stack.imgur.com/oQTIO.png" alt="AMD core according to Anandtech"> So, fetch, decode, and Branch units are common for all SIMD cores, and there are lot of ALU pipelines. In AMD the register file is segmented between groups of ALU, and in Nvidia it was shown as single unit (but it may be implemented as segmented and accessed via interconnect netwoork) As said in this work <blockquote> Fine-grained parallelism, however, is what sets GPUs apart. Recall that threads execute synchronously in bundles known as warps. GPUs run most efficiently when the number of warps-in-flight is large. Although only one warp can be serviced per cycle (Fermi technically services two half-warps per shader cycle), the SM's scheduler will immediately switch to another active warp when a hazard is encountered. If the instruction stream generated by the CUDA compiler expresses an ILP of 3.0 (that is, an average of three instructions can be executed before a hazard), and the instruction pipeline depth is 22 stages, as few as eight active warps (22 / 3) may be sufficient to completely hide instruction latency and achieve max arithmetic throughput. GPU latency hiding delivers good utilization of the GPU's vast execution resources with little burden on the programmer. </blockquote> So, only one warp at a time will be dispatched every clock from pipeline frontend (SM scheduler) and there is some latency between scheduler's dispatch and time when ALU finish calculations. There is part of picture from Realworldtech http://www.realworldtech.com/cayman/5/ and http://www.realworldtech.com/cayman/11/ with Fermi pipeline. Note the <code>[16]</code> note in every ALU/FPU - this means that there are 16 same ALU physically. <img src="https://i.stack.imgur.com/QcF6w.png" alt="fermi pipeline according to Realwordtech">

CUDA core pipeline

Tags:

cuda

gpgpu

pipeline

I am following this article about a prediction model for GPU. In page 5 second column almost at the end they state

One has to finally take care of the fact that each of the Nc cores(SPs) in an SM on the GPU has a D-deep pipeline that has the effect of executing D threads in parallel.

My question is related to the D-deep pipeline. What does this pipeline look like? Is it something similar to the pipeline of the CPU (I mean only the idea because GPU-CPU are architectures completely different) about the fetch, decode, execute, write-back?

Is there a doc where this is documented?

943

asked May 22 '13 13:05

BRabbit27

2 Answers

Yes, GPU SM's pipeline looks bit like CPU's. The difference is in frontend/backend proportions of the pipeline: GPU has single fetch/decode and a lot of small ALU (think as there are 32 parallel Execute subpipelines), grouped as "Cuda cores" inside the SM. This is similar to superscalar CPUs (e.g. Core-i7 has 6-8 issue ports, one port per independent ALU pipeline).

There is GTX 460 SM (image from destructoid.com; we can even see what is inside each CUDA core two pipelines: Dispatch port, then Operand collector, then two parallel Units, one for Int and other for FP and the Result queue):

(or better quality image http://www.legitreviews.com/images/reviews/1193/sm.jpg from http://www.legitreviews.com/article/1193/2/)

We see that there is one Instruction cache in this SM, two warp schedulers and 4 dispatch units. And there is single register file. So, first stages of GPU SM pipeline are common resource of SM. After instruction planning they are dispatched to CUDA cores, and each core may have its own multistaged (pipelined) ALU, especially for complex operations.

Length of the pipeline is hidden inside the architecture, but I assume that total pipeline depth is much more than 4. (There is clearly instructions with 4 clock ticks latency so ALU pipeline is >= 4 stages and total SM pipeline depth is assumed to be more than 20 stages: https://devtalk.nvidia.com/default/topic/390366/instruction-latency/ )

There is some additional info about instruction full latencies: https://devtalk.nvidia.com/default/topic/419456/how-to-schedule-warps-/ - 24-28 clocks for SP and 48-52 clocks for DP.

Anandtech posted some pictures of AMD GPU, and we can assume that main ideas of pipelining should be similar for both vendors: http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

AMD core according to Anandtech

So, fetch, decode, and Branch units are common for all SIMD cores, and there are lot of ALU pipelines. In AMD the register file is segmented between groups of ALU, and in Nvidia it was shown as single unit (but it may be implemented as segmented and accessed via interconnect netwoork)

As said in this work

Fine-grained parallelism, however, is what sets GPUs apart. Recall that threads execute synchronously in bundles known as warps. GPUs run most efficiently when the number of warps-in-flight is large. Although only one warp can be serviced per cycle (Fermi technically services two half-warps per shader cycle), the SM's scheduler will immediately switch to another active warp when a hazard is encountered. If the instruction stream generated by the CUDA compiler expresses an ILP of 3.0 (that is, an average of three instructions can be executed before a hazard), and the instruction pipeline depth is 22 stages, as few as eight active warps (22 / 3) may be sufficient to completely hide instruction latency and achieve max arithmetic throughput. GPU latency hiding delivers good utilization of the GPU's vast execution resources with little burden on the programmer.

So, only one warp at a time will be dispatched every clock from pipeline frontend (SM scheduler) and there is some latency between scheduler's dispatch and time when ALU finish calculations.

There is part of picture from Realworldtech http://www.realworldtech.com/cayman/5/ and http://www.realworldtech.com/cayman/11/ with Fermi pipeline. Note the [16] note in every ALU/FPU - this means that there are 16 same ALU physically.

fermi pipeline according to Realwordtech

136

answered Sep 24 '22 12:09

osgx

Ordinary thread-level parallelism comes about in a GPU SM when multiple warps are available for execution. Hardware multithreading is described here

The paper is fairly old and has the GTX 280 GPU in view. GPUs prior to the Fermi generation had a SM processing arrangement that looks a little different than the SM arrangement in Fermi and later GPUs. The high-level processing effect is the same -- 32 threads in a warp are executed in "lockstep" -- but whereas later SMs have at least 32 SP's (cores) per SM, GPUs prior to the Fermi generation had fewer cores per SM -- typically 8. The effect is that a given warp instruction is executed in stepwise fashion, and each "core" or "SP" is in effect handling multiple lanes within the warp (in a stepwise fashion) in order to process a particular warp instruction. I believe (based on what I see in the paper) that this is the "pipeline" that is being referred to. In effect, each "core" in a GTX 280 has a "4-deep pipeline" that is handling 4 threads (out of the warp) and therefore requires 4 clocks (minimum) to actually complete processing of the 4 threads in the warp that are assigned to it. This is documented here and you may wish to compare the description to that given for later GPU generations such as the cc 2.0 description given here.

And yes, for those who would argue with my usage of "cores" and "SP's", I agree that it's an inadequate description of how the compute resources in a GPU SM are actually laid out, but I believe this description is consistent with NVIDIA marketing and training literature, and consistent with the way the term "core" or "SP" is being used in the referenced paper.

answered Sep 22 '22 12:09

Robert Crovella

Related questions
                            
                                Using thrust with printf / cout
                            
                                nvcc fatal : Value 'sm_61' is not defined for option 'gpu-architecture' error with theano
                            
                                Implementation of bit rotate operators using SIMD in CUDA
                            
                                Real-time video encoding in DirectShow
                            
                                Creating a linked list using CUDA
                            
                                Testing for infinity in CUDA
                            
                                STL, iostream, new, delete in C/C++ for CUDA
                            
                                CUDA: In which memory space is a fixed size array stored?
                            
                                sending 3d array to CUDA kernel
                            
                                How to create LLVM structure value?
                            
                                skipping incompatible libcudart.so when searching for -lcudart
                            
                                Cumulative summation in CUDA
                            
                                cudaMallocManaged vs cudaMalloc - Device memory limitation scenario
                            
                                What will happen to the allocated memory on GPU, after the application using it exits, if cudaFree() was not used?
                            
                                What type should I use for an index variable
                            
                                Calling CUDA code from regular C++ code - sorting out the extern "C"
                            
                                Why is cudaMalloc giving me an error when I know there is sufficient memory space?
                            
                                CUDA Dot Product
                            
                                Multiply Rectangular Matrices in CUDA
                            
                                CUDA function pointers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With