Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How long does each machine language instruction take to execute? [duplicate]

Do operations like set, read, move and compare all take the same time to execute?

If not: Is there any way to find out how long.

Is there some name for what I mean, some specific type cpu's speed of executing the different assembly language instructions (move, read, etc.)

like image 247
xrDDDD Avatar asked Jan 21 '12 00:01

xrDDDD


People also ask

How many cycles does an instruction take?

Without instruction-level parallelism, simple instructions usually take 4 or more cycles to execute. Instructions that execute loops take at least one clock per loop iteration. Pipelining (overlapping execution of instructions) can bring the average for simple instructions down to near 1 clock per instruction.

What is machine language instruction?

Machine code or machine language is a set of instructions executed directly by a computer's central processing unit (CPU). Each instruction performs a very specific task, such as a load, a jump, or an ALU operation on a unit of data in a CPU register or memory.


2 Answers

The key terms you're probably looking are:

  • Instruction Latency
  • Instruction Throughput

These should be easy to google for. But basically, instructions take a certain number of cycles to execute (latency). But you can often execute multiple of them simultaneously (throughput).

Do operations like set, read, move and compare all take the same time to execute?

In general no. Different instructions have different latencies and throughputs. For example, an addition is typically much faster than a division.


If you're interested in the actual values of different assembly instructions on modern processors, you can take a look at Agner Fog's tables.


That said, there's about a gazzillion other factors that affect the performance of a computer.
Most of which are arguably more important than instruction latencies/throughputs:

  • Cache
  • Memory
  • Disk
  • Bloat (this seems to be a big one... :D)
  • etc... the list goes on and on...
like image 83
Mysticial Avatar answered Oct 04 '22 08:10

Mysticial


Pipelining and caches and the cpu itself no longer being the primary bottleneck has done two things to your question. One, the cpu's today generally execute one instruction per clock, second it can take many (dozens to hundreds) of clocks to feed the cpu an instruction. The more modern processors, even if their instruction sets are old, rarely bother to mention clock execution because it is one clock and the "real" execution speed is too hard to describe.

The cache and pipeline try to allow the cpu to run at this one instruction per clock rate, but for example a read from memory, has to wait for the response to come back. If this item is not in cache this can be hundreds of clock cycles as it will have to read a number of locations to fill a line in the cache then some more clocks to get it through the caches back to the processor.

Now if you go back in time, or present time but in the microcontroller world for example or other system where the memory system can respond in one clock, or at least a very deterministic number (say two clocks for eeprom and one for ram, that kind of thing), then you can very easily count the exact number of clocks. Processors like often do publish a table of cycles per instruction. A two instruction read for example would be two clocks to fetch the instruction, then another clock to perform the read, 3 clocks minimum. some would actually take more than one clock to execute so that would be added in as well.

I highly recommend finding a (used) copy of Zen of Assembly Language by Michael Abrash. It was dated when it came out but still an important work. learning to juggle the relatively simple 8088/86 was tough enough, todays x86 and other systems are quite a bit more complicated.

If running windows or linux or something like that trying to time your code wont necessarily get you to where you want. add or remove a nop, causing the code to be aligned in memory by as much as a byte can have dramatic affects on the performance of the remainder of the code which other than its location in ram has not changed. As a simple example of understanding the complicated nature of the problem.

What processor or system are you interested in? the stm32f4 discovery board, about $20, contains an ARM (cortex-m) processor with instruction and data caches. It has the complications of a bigger system, but at the same time simple enough (relative to a bigger system) to be able to have controlled experiments.

If you are familiar with the microchip pic world they often count cycles to perform precision delays between events. A very deterministic environment (so long as you dont use interrupts).

like image 34
old_timer Avatar answered Oct 04 '22 06:10

old_timer