When learning assembly I realized that I should put frequently accessed data in registers instead of memory for memory is much slower.
The question is, how can CPU run faster than memory since the instructions are fetched from memory in the first place? Does CPU usualy spend a lot of time waiting for instructions from memory?
EDIT: To run a program, we need to compile it to a file containing machine codes. Then we load that file into memory, and run one instruction after another. The CPU needs to know what instruction to run, and that piece of information is fetched from memory. I'm not asking about manipulating data but about the process of reading the instructions from memory. Sorry if I wasn't clear enough.
EDIT 2:
Example: xor eax, eax
compiles to 31c0
on my computer. I know this instruction itself is fast. But to clear eax
, the CPU needs to read 31c0
from memory first. That read should take a lot of time if accessing memory is slow, and for this period the CPU just stalls?
Code fetch in parallel with instruction execution is so critical that even 8086 did it (to a limited extent, with a very small prefetch buffer and low bandwidth). Even so, code fetch bandwidth actually was THE major bottleneck for 8086.
(I just realized you didn't tag this x86, although you did use an x86 instruction as an example. All my examples are x86, but the basics are pretty much the same for any other architecture. Except that non-x86 CPUs won't use a decoded-uop cache, x86 is the only ISA still in common use that's so hard to decode that it's worth caching the decode results.)
In modern CPUs, code-fetch is rarely a bottleneck because caches and prefetching hide the latency, and bandwidth requirements are usually low compared to the bandwidth required for data. (Bloated code with a very large code footprint can run into slowdowns from instruction-cache misses, though, leading to stalls in the front-end.)
L1I cache is separate from L1D cache, and CPUs fetch/decode a block of at least 16 bytes of x86 code per cycle. CPU with a decoded-uop cache (Intel Sandybridge family, and AMD Ryzen) even cache already-decoded instructions to remove decode bottlenecks.
See http://www.realworldtech.com/sandy-bridge/3/ for a fairly detailed write-up of the front-end in Intel Sandybridge (fetch/pre-decode/decode/rename+issue), with block diagrams like this, showing Intel Sandybridge vs. Intel Nehalem and AMD Bulldozer's instruction-fetch logic. (Decode is on the next page). The "pre-decode" stage finds instruction boundaries (i.e. decodes instruction-length ahead of decoding what each instruction actually is).
L1I cache misses result in a request to the unified L2. Modern x86 CPUs also have a shared L3 cache (shared between multiple cores).
Hardware prefetching brings soon-to-be-needed code into L2 and L1I, just like data prefetching into L2 and L1D. This hides the > 200 cycle latency to DRAM most of the time, usually only failing on jumps to "cold" functions. It can almost always stay ahead of decode/execute when running a long sequence of code with no taken branches, unless something else (like data loads/stores) is using up all the memory bandwidth.
You could construct some code that decodes at 16 bytes per cycle, which may be higher than main-memory bandwidth. Or maybe even higher on an AMD CPU. But usually decode bottlenecks will limit you more than pure code-fetch bandwidth.
See also Agner Fog's microarch guide for more about the front-end in various microarchitectures, and optimizing asm for them.
See also other CPU performance links in the x86 tag wiki.
If you have frequently accessed data, chances are that you also have the same instructions repeatedly processing them. An efficient CPU will not fetch the same instructions again and again from a slow memory. Instead, they are put in an instruction cache which has very little access time. Therefore, the cpu doesn't need to wait for instructions in general.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With