Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA core pipeline

I am following this article about a prediction model for GPU. In page 5 second column almost at the end they state

One has to finally take care of the fact that each of the Nc cores(SPs) in an SM on the GPU has a D-deep pipeline that has the effect of executing D threads in parallel.

My question is related to the D-deep pipeline. What does this pipeline look like? Is it something similar to the pipeline of the CPU (I mean only the idea because GPU-CPU are architectures completely different) about the fetch, decode, execute, write-back?

Is there a doc where this is documented?

like image 943
BRabbit27 Avatar asked May 22 '13 13:05

BRabbit27


People also ask

What is CUDA pipeline?

The class template cuda::pipeline provides a coordination mechanism which can sequence asynchronous operations, such as cuda::memcpy_async , into stages. A thread interacts with a pipeline stage using the following pattern: Acquire the pipeline stage. Commit some operations to the stage.

What do CUDA cores do?

CUDA, which stands for Compute Unified Device Architecture, Cores are the Nvidia GPU equivalent of CPU cores that have been designed to take on multiple calculations at the same time, which is significant when you're playing a graphically demanding game. One CUDA Core is very similar to a CPU Core.

Is a CUDA core the same as a CPU core?

CUDA Cores are used for a lot of things, but the main thing they're used for is to enable efficient parallel computing. A single CUDA core is similar to a CPU core, with the primary difference being that it is less capable but implemented in much greater numbers. Which again allows for great parallel computing.

Does more CUDA cores mean better?

Upgrading Your Graphics Card Using a graphics card that comes equipped with CUDA cores will give your PC an edge in overall performance, as well as in gaming. More CUDA cores mean clearer and more lifelike graphics. Just remember to take into account the other features of the graphics card as well.


2 Answers

Yes, GPU SM's pipeline looks bit like CPU's. The difference is in frontend/backend proportions of the pipeline: GPU has single fetch/decode and a lot of small ALU (think as there are 32 parallel Execute subpipelines), grouped as "Cuda cores" inside the SM. This is similar to superscalar CPUs (e.g. Core-i7 has 6-8 issue ports, one port per independent ALU pipeline).

There is GTX 460 SM (image from destructoid.com; we can even see what is inside each CUDA core two pipelines: Dispatch port, then Operand collector, then two parallel Units, one for Int and other for FP and the Result queue): GTX 460 SM

(or better quality image http://www.legitreviews.com/images/reviews/1193/sm.jpg from http://www.legitreviews.com/article/1193/2/)

We see that there is one Instruction cache in this SM, two warp schedulers and 4 dispatch units. And there is single register file. So, first stages of GPU SM pipeline are common resource of SM. After instruction planning they are dispatched to CUDA cores, and each core may have its own multistaged (pipelined) ALU, especially for complex operations.

Length of the pipeline is hidden inside the architecture, but I assume that total pipeline depth is much more than 4. (There is clearly instructions with 4 clock ticks latency so ALU pipeline is >= 4 stages and total SM pipeline depth is assumed to be more than 20 stages: https://devtalk.nvidia.com/default/topic/390366/instruction-latency/ )

There is some additional info about instruction full latencies: https://devtalk.nvidia.com/default/topic/419456/how-to-schedule-warps-/ - 24-28 clocks for SP and 48-52 clocks for DP.

Anandtech posted some pictures of AMD GPU, and we can assume that main ideas of pipelining should be similar for both vendors: http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

AMD core according to Anandtech

So, fetch, decode, and Branch units are common for all SIMD cores, and there are lot of ALU pipelines. In AMD the register file is segmented between groups of ALU, and in Nvidia it was shown as single unit (but it may be implemented as segmented and accessed via interconnect netwoork)

As said in this work

Fine-grained parallelism, however, is what sets GPUs apart. Recall that threads execute synchronously in bundles known as warps. GPUs run most efficiently when the number of warps-in-flight is large. Although only one warp can be serviced per cycle (Fermi technically services two half-warps per shader cycle), the SM's scheduler will immediately switch to another active warp when a hazard is encountered. If the instruction stream generated by the CUDA compiler expresses an ILP of 3.0 (that is, an average of three instructions can be executed before a hazard), and the instruction pipeline depth is 22 stages, as few as eight active warps (22 / 3) may be sufficient to completely hide instruction latency and achieve max arithmetic throughput. GPU latency hiding delivers good utilization of the GPU's vast execution resources with little burden on the programmer.

So, only one warp at a time will be dispatched every clock from pipeline frontend (SM scheduler) and there is some latency between scheduler's dispatch and time when ALU finish calculations.

There is part of picture from Realworldtech http://www.realworldtech.com/cayman/5/ and http://www.realworldtech.com/cayman/11/ with Fermi pipeline. Note the [16] note in every ALU/FPU - this means that there are 16 same ALU physically.

fermi pipeline according to Realwordtech

like image 136
osgx Avatar answered Sep 24 '22 12:09

osgx


Ordinary thread-level parallelism comes about in a GPU SM when multiple warps are available for execution. Hardware multithreading is described here

The paper is fairly old and has the GTX 280 GPU in view. GPUs prior to the Fermi generation had a SM processing arrangement that looks a little different than the SM arrangement in Fermi and later GPUs. The high-level processing effect is the same -- 32 threads in a warp are executed in "lockstep" -- but whereas later SMs have at least 32 SP's (cores) per SM, GPUs prior to the Fermi generation had fewer cores per SM -- typically 8. The effect is that a given warp instruction is executed in stepwise fashion, and each "core" or "SP" is in effect handling multiple lanes within the warp (in a stepwise fashion) in order to process a particular warp instruction. I believe (based on what I see in the paper) that this is the "pipeline" that is being referred to. In effect, each "core" in a GTX 280 has a "4-deep pipeline" that is handling 4 threads (out of the warp) and therefore requires 4 clocks (minimum) to actually complete processing of the 4 threads in the warp that are assigned to it. This is documented here and you may wish to compare the description to that given for later GPU generations such as the cc 2.0 description given here.

And yes, for those who would argue with my usage of "cores" and "SP's", I agree that it's an inadequate description of how the compute resources in a GPU SM are actually laid out, but I believe this description is consistent with NVIDIA marketing and training literature, and consistent with the way the term "core" or "SP" is being used in the referenced paper.

like image 27
Robert Crovella Avatar answered Sep 22 '22 12:09

Robert Crovella