I wrote a simple program that executes a bunch of NOP instructions in a loop, and to my surprise it executes about 10600000000 of them per second, or about 10Ghz, while my CPU is only 2.2GHz.
How is this possible? Is the CPU treating them as a single mega-NOP, or did I just discover what "instruction level parallelism" means?
What would be a better measure for instructions per second? Doing add instructions reaches only 414900000/s, a tenth of the bogomips reported by my CPU: 4390.03
C code:
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#define ten(a) a a a a a a a a a a
#define hundred(a) ten(a) ten(a) ten(a) ten(a) ten(a) ten(a) ten(a) \
ten(a) ten(a) ten(a)
#define ITER 10000000
int main(void) {
uint64_t i=0;
uint64_t t=time(NULL);
while(1) {
for(int j=0; j<ITER;j++) {
hundred(asm volatile ("nop");)
}
i+=ITER*100;
printf("%lu/%lu\n", i, time(NULL)-t);
}
return 0;
}
Compiled assembly:
.file "gbloopinc.c"
.section .rodata
.LC0:
.string "%lu/%lu\n"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movq $0, -16(%rbp)
movl $0, %edi
call time
movq %rax, -8(%rbp)
.L4:
movl $0, -20(%rbp)
jmp .L2
.L3:
#APP
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
#NO_APP
addl $1, -20(%rbp)
.L2:
cmpl $9999999, -20(%rbp)
jle .L3
addq $1000000000, -16(%rbp)
movl $0, %edi
call time
subq -8(%rbp), %rax
movq %rax, %rdx
movq -16(%rbp), %rax
movq %rax, %rsi
movl $.LC0, %edi
movl $0, %eax
call printf
jmp .L4
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbits
Clock speed CPUs can only carry out one instruction at a time.
NOP consumes two clock cycles.
Without pipelining, in a multi-cycle processor, a new instruction is fetched in stage 1 only after the previous instruction finishes at stage 5, therefore the number of clock cycles it takes to execute an instruction is five (CPI = 5 > 1).
This has nothing to do with multiple cores. Cores are not "ports".
4 NOPs per clock is the issue/retirement pipeline width of your superscalar / out-of-order CPU. NOPs don't even need an execution unit / execution port (ALU or load or store), so you're not even limited by the number of integer execution units. Even Core2 (Intel's first 4-wide x86 CPU) could run 4 NOPs per clock.
As you guessed, this is an example of Instruction-level Parallelism. NOPs of course have no input dependencies.
On your Sandybridge CPU (with 3 ALU execution units per core), you could run 3 ADD and one load or store instruction per clock, since its pipeline width is 4 uops. See Agner Fog's microarch pdf and other links in the x86 tag wiki. On a stream of independent ADD instructions, like
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx
...
you'd see about 3 per clock throughput on SnB, bottlenecking on integer ALU execution ports. Haswell could run this at 4 ADDs per clock, because it has a 4th ALU execution port that can handle non-vector integer ops (and branches).
Out-of-order CPUs typically have a wider front-end and issue/retire width than the number of execution units. Having more instructions decoded and ready to execute as soon as there's a free execution unit increases their utilization. Otherwise the out-of-order machinery could only see ahead of what's currently executing if execution stalled or slowed down due to serial dependencies. (e.g. add eax,eax
/ add eax,eax
needs the output of the first add as the input to the second add, so can only run at one insn per clock.)
I'll expand more on HansPassant's comment.
Modern processors are both superscalar and multicore. It is easy to understand what a multicore processor is - it has multiple cores. Superscalar, on the other hand, requires a bit more knowledge about the hardware. This is a stackexchange question the explains what it means for a processor to be superscalar. Superscalar processors have many functional units in the same core, and is heavily pipelined. This is why multiple instructions can be dispatched and going on at the same time in a single core. Here are some of the of functional units in a processor: integer addition/subtraction, floating point multiplication, floating point division, integer multiplication, integer division.
I encourage you to Google more about superscalar processors and look up more info about your processor in particular.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With