My guess is that the __no_operation()
intrinsic (ARM) instruction should take 1/(168 MHz) to execute, provided that each NOP
executes in one clock cycle, which I would like to verify via documentation.
Is there a standard location for information regarding the instruction cycle execution time for a processor? I am trying to determine how long an STM32f407IGh6 processor should take to execute a NOP instruction running at 168 MHz.
Some processors require multiple oscillations per instruction cycle, some are 1-to-1 in comparing clock-cycles to instruction-cycles.
The term "instruction cycle" is not even present in the entirety of the datasheet provided by STMicro, nor in their programming manual (listing the processor's instruction set, btw). The 8051 documentation, however, clearly defines its instruction cycle execution times, in addition to its machine cycle characteristics.
With every tick of the clock, the CPU fetches and executes one instruction. The clock speed is measured in cycles per second, and one cycle per second is known as 1 hertz. This means that a CPU with a clock speed of 2 gigahertz (GHz) can carry out two thousand million (or two billion) cycles per second.
The basic operation of a computer is called the 'fetch-execute' cycle. The CPU is designed to understand a set of instructions - the instruction set. It fetches the instructions from the main memory and executes them. This is done repeatedly from when the computer is booted up to when it is shut down.
So , higher the frequency , faster will be the instruction execution speed of the processor ( CPU ). For each clock cycle , the CPU completes a part of the execution process. This part instruction execution can either be a fetch , decode , execute or store operation.
ALL instructions require more than one clock cycle to execute. Fetch, decode, execute. If you are running on an stm32 you are likely taking several clocks per fetch just due to the slowness of the prom, if running from ram who knows if it is 168Mhz or slower. the arm busses generally take a number of clock cycles to do anything.
Nobody talks about instruction cycles anymore because they are not deterministic. The answer is always "it depends".
It may take X hours to build a single car, but if you start building a car then 30 seconds later start building another and every 30 seconds start another then after X hours you will have a new car every 30 seconds. Does that mean it takes 30 seconds to make a car? Of course not. But it does mean that once up and running you can average a new car every 30 seconds on that production line.
That is exactly how processors work, it takes a number of clocks per instruction to run, but you pipeline theme so that many are in the pipe at once so that the average is such that the core, if fed the right instructions one per clock, can complete those instructions one per clock. With branching, and slow memory/rom, you cant even expect to get that.
if you want to do an experiment on your processor, then make a loop with a few hundred nops
beg = read time
load r0 = 100000
top:
nop
nop
nop
nop
nop
nop
...
nop
nop
nop
r0 = r0 - 1
bne top
end = read timer
If it takes fractions of a second to complete that loop then either make the number of nops larger or have it run an order of magnitude more loops. Actually you want to hit a significant number of timer ticks, not necessarily seconds or minutes on a wall clock but something in terms of a good sized number of timer ticks.
Then do the math and compute the average.
Repeat the experiment with the program sitting in ram instead of rom
Slow the processor clock down to whatever the fastest time is that does not require a flash divisor, repeat running from flash.
being a cortex-m4 turn the I cache on, repeat using flash, repeat using ram (At 168Mhz).
If you didnt get a range of different results from all of these experiments using the same test loop, you are probably doing something wrong.
If you carefully configure all your clocks in the Reset and Clock Control (RCT) and you know all the clocks you can exactly calculate the instruction execution time for most of the instructions and have at least a worst case evaluation for all of them. For example I'm using a stm32f439Zi processor, which is a cortex-m4 compatible with the stm32f407. If you look at the reference manual the clock tree is showing you the PLL and all buss prescalers. In my case I have a 8 MHz external quarts with PLL configured to provide 84 Mhz system clock SYSCLK. That means that one processor cycle is 1.0/84e6 ~ 12 ns.
For reference of the how many cycles or SYSCLK one instruction takes you are using the ARM® Cortex®‑M4 Processor Technical Reference Manual. For example the MOV instruction in most of the cases takes a cycle. ADD instruction in most of the cases takes a cycle, which means that after 12 ns you have the result of the addition stored in the register and ready for a use by another operation.
You can use that information to schedule your processor resources in many cases, such as periodic interrupts for instance, and the electrical and the low-level embedded system software developers are talking about that and are doing that when it comes to strict real-time and safety critical systems. Normally engineers are working with the worst case execution time during the design ignoring the pipeline to have a quick and rough inside of the processor load. At the implementation you are using tools for precise time analysis and refine the software.
In the process of the design and implementation the non-deterministic things are reduced to negligible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With