Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can x86_64 CPU execute two same operations on the same stage of pipeline?

As known Intel x86_64 processors are not only pipelined architecture, but also superscalar.

This is mean that CPU can:

  1. Pipeline - At one clock, execute some stages of one operation. For example, two ADDs in parallel with shifting of stages:

    • ADD(stage1) -> ADD(stage2) -> nothing
    • nothing -> ADD(stage1) -> ADD(stage2)
  2. Superscalar - At one clock, execute some different operations. For example, ADD and MUL in parallel in the same stages:

    • ADD(stage1) -> ADD(stage2)
    • MUL(stage1) -> MUL(stage2)

enter image description here

This is possible due to the fact that the processor has several schedulers of instructions (Intel Core have 4 Simple Decoder).

But are there only duplicates of schedulers (4 Simple Decoders), or also are there duplicates of arithmetic unit?

I.e. can we execute, for example, two ADDs in the same stages, but on the independent arithmetic units (for example, ALU on Port 0 and ALU on Port 1) on the same CPU-Core?

  • ADD1(stage1) -> ADD1(stage2)
  • ADD2(stage1) -> ADD2(stage2)

Are there duplicates of the any executing unit which make able to execute two the same instructions at the same one clock?

like image 991
Alex Avatar asked Oct 20 '22 18:10

Alex


1 Answers

yes. The question already contained the answer, as the comments explained. :P

(just posting an answer to get this out of the unanswered questions list.)

I will add that Sandybridge and later Intel CPUs, with their uop cache, can more often come close to sustaining 4 uops per cycle in loops than previous CPUs (if the frontend is the bottleneck, rather than data dependencies (latency) or execution-port contention (throughput).) This is esp. helpful with longer-to-encode vector instructions, because the decoders can only handle 16B / cycle, which is often less than 4 uops.

See http://agner.org/optimize/, esp. the microarch doc, for details about instruction throughput out of the uop cache, and how uop cache-line boundaries can interfere with delivering the constant 4 uops per cycle the pipeline can handle. Small loops that fit in the loop buffer don't suffer from this potential bottleneck.

In reply to one of the comments: micro-fusion doesn't let you get more than 4 instructions per cycle to run. Only macro-fusion combines multiple instructions into a single uop. (micro-fusion does make it cheaper to use instructions with memory operands, but apparently only works with one-register addressing modes. This does increase IPC, and could bring the average above 4.)

like image 118
Peter Cordes Avatar answered Oct 22 '22 23:10

Peter Cordes