Well looks too simple a question to be asked but i asked after going through few ppts on both. Both methods increase instruction throughput. And Superscaling almost always makes use of pipelining as well. Superscaling has more than one execution unit and so does pipelining or am I wrong here?

A long time ago, CPUs executed only one machine instruction at a time. Only when it was completely finished did the CPU fetch the next instruction from memory (or, later, the instruction cache). Eventually, someone noticed that this meant that most of a CPU did nothing most of the time, since there were several execution subunits (such as the instruction decoder, the integer arithmetic unit, and FP arithmetic unit, etc.) and executing an instruction kept only one of them busy at a time. Thus, "simple" pipelining was born: once one instruction was done decoding and went on towards the next execution subunit, why not already fetch and decode the next instruction? If you had 10 such "stages", then by having each stage process a different instruction you could theoretically increase the instruction throughput tenfold without increasing the CPU clock at all! Of course, this only works flawlessly when there are no conditional jumps in the code (this led to a lot of extra effort to handle conditional jumps specially). Later, with Moore's law continuing to be correct for longer than expected, CPU makers found themselves with ever more transistors to make use of and thought "why have only one of each execution subunit?". Thus, superscalar CPUs with multiple execution subunits able to do the same thing in parallel were born, and CPU designs became much, much more complex to distribute instructions across these fully parallel units while ensuring the results were the same as if the instructions had been executed sequentially.

what is difference between Superscaling and pipelining?

2 Answers

Superscalar design involves the processor being able to issue multiple instructions in a single clock, with redundant facilities to execute an instruction. We're talking about within a single core, mind you -- multicore processing is different.

Pipelining divides an instruction into steps, and since each step is executed in a different part of the processor, multiple instructions can be in different "phases" each clock.

They're almost always used together. This image from Wikipedia shows both concepts in use, as these concepts are best explained graphically:

Here, two instructions are being executed at a time in a five-stage pipeline.

To break it down further, given your recent edit:

In the example above, an instruction goes through 5 stages to be "performed". These are IF (instruction fetch), ID (instruction decode), EX (execute), MEM (update memory), WB (writeback to cache).

In a very simple processor design, every clock a different stage would be completed so we'd have:

Which would do one instruction in five clocks. If we then add a redundant execution unit and introduce superscalar design, we'd have this, for two instructions A and B:

IF(A) IF(B)
ID(A) ID(B)
EX(A) EX(B)
MEM(A) MEM(B)
WB(A) WB(B)

Two instructions in five clocks -- a theoretical maximum gain of 100%.

Pipelining allows the parts to be executed simultaneously, so we would end up with something like (for ten instructions A through J):

IF(A) IF(B)
ID(A) ID(B) IF(C) IF(D)
EX(A) EX(B) ID(C) ID(D) IF(E) IF(F)
MEM(A) MEM(B) EX(C) EX(D) ID(E) ID(F) IF(G) IF(H)
WB(A) WB(B) MEM(C) MEM(D) EX(E) EX(F) ID(G) ID(H) IF(I) IF(J)
WB(C) WB(D) MEM(E) MEM(F) EX(G) EX(H) ID(I) ID(J)
WB(E) WB(F) MEM(G) MEM(H) EX(I) EX(J)
WB(G) WB(H) MEM(I) MEM(J)
WB(I) WB(J)

In nine clocks, we've executed ten instructions -- you can see where pipelining really moves things along. And that is an explanation of the example graphic, not how it's actually implemented in the field (that's black magic).

The Wikipedia articles for Superscalar and Instruction pipeline are pretty good.

answered Oct 14 '22 05:10

Jed Smith

A long time ago, CPUs executed only one machine instruction at a time. Only when it was completely finished did the CPU fetch the next instruction from memory (or, later, the instruction cache).

Eventually, someone noticed that this meant that most of a CPU did nothing most of the time, since there were several execution subunits (such as the instruction decoder, the integer arithmetic unit, and FP arithmetic unit, etc.) and executing an instruction kept only one of them busy at a time.

Thus, "simple" pipelining was born: once one instruction was done decoding and went on towards the next execution subunit, why not already fetch and decode the next instruction? If you had 10 such "stages", then by having each stage process a different instruction you could theoretically increase the instruction throughput tenfold without increasing the CPU clock at all! Of course, this only works flawlessly when there are no conditional jumps in the code (this led to a lot of extra effort to handle conditional jumps specially).

Later, with Moore's law continuing to be correct for longer than expected, CPU makers found themselves with ever more transistors to make use of and thought "why have only one of each execution subunit?". Thus, superscalar CPUs with multiple execution subunits able to do the same thing in parallel were born, and CPU designs became much, much more complex to distribute instructions across these fully parallel units while ensuring the results were the same as if the instructions had been executed sequentially.

answered Oct 14 '22 05:10

Michael Borgwardt

Related questions
                            
                                Detecting CPU alignment requirements
                            
                                How are shifts implemented on the hardware level?
                            
                                Single- vs. multi-threaded programming on a single core processor
                            
                                Where did code morphing go? [closed]
                            
                                CPU and Data alignment
                            
                                What are traps?
                            
                                Direct memory access DMA - how does it work?
                            
                                How is arctan implemented?
                            
                                Programming for Multi core Processors
                            
                                How to determine which logical cores share the same physical core?
                            
                                What is the Most Efficient Java-Based streaming XSLT Processor? [closed]
                            
                                What's a good source to learn about QEMU?
                            
                                Using SSE instructions
                            
                                Processor, OS : 32bit, 64 bit
                            
                                Determine word size of my processor
                            
                                Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
                            
                                Why is Intel Haswell XEON CPU sporadically miscomputing FFTs and ART?
                            
                                Polling or Interrupt based method
                            
                                What is a clock cycle and clock speed?
                            
                                Getting processor information in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what is difference between Superscaling and pipelining?

Tags:

processor

pipelining

Alex Xander

People also ask

2 Answers

Jed Smith

Michael Borgwardt

Recent Activity

Donate For Us