Exact copy of machine code runs 50% slower than the original function

Question

I've been experimenting a bit with execution from RAM and flash memory on embedded systems. For rapid prototyping and testing I'm currently using an Arduino Due (SAM3X8E ARM Cortex-M3). As far as I can see, the Arduino runtime and bootloader should make no difference here.

Here is the issue: I have a function (calc) which is written in ARM Thumb Assembly. calc calculates a number and returns it. (>1s runtime for the given input) Now I manually extracted the assembled machine code of that function and put it as raw bytes into another function. Both functions are confirmed to reside in flash memory (Address 0x80149 and 0x8017D, right next to each other). This has been confirmed both via disassembly and a runtime check.

void setup() {
  Serial.begin(115200);
  timeFnc(calc);
  timeFnc(calc2);
}

void timeFnc(int (*functionPtr)(void)) {
  unsigned long time1 = micros();

  int res = (*functionPtr)();

  unsigned long time2 = micros();
  Serial.print("Address: ");
  Serial.print((unsigned int)functionPtr);
  Serial.print(" Res: ");
  Serial.print(res);
  Serial.print(": ");
  Serial.print(time2-time1);
  Serial.println("us");

}

int calc() {
   asm volatile(
      "movs r1, #33 
	"
      "push {r1,r4,r5,lr} 
	"
      "bl .in 
	"
      "pop {r1,r4,r5,lr} 
	"
      "bx lr 
	"

      ".in: 
	"
      "movs r5,#1 
	"
      "subs r1, r1, #1 
	"
      "cmp r1, #2 
	"
      "blo .lblb 
	"
      "movs r5,#1 
	"

      ".lbla: 
	"
      "push {r1, r5, lr} 
	"
      "bl .in 
	"
      "pop {r1, r5, lr} 
	"
      "adds r5,r0 
	"
      "subs r1,#2 
	"
      "cmp r1,#1 
	"
      "bhi .lbla 
	"
      ".lblb: 
	"
      "movs r0,r5 
	"
      "bx lr 
	"
      ::
   ); //redundant auto generated bx lr, aware of that
}

int calc2() {
  asm volatile(
    ".word  0xB5322121 
	"
    ".word  0xF803F000 
	"
    ".word  0x4032E8BD 
	"
    ".word  0x25014770 
	"

    ".word  0x29023901 
	"
    ".word  0x800BF0C0 
	"
    ".word  0xB5222501 
	"
    ".word  0xFFF7F7FF 
	"
    ".word  0x4022E8BD 
	"
    ".word  0x3902182D 
	"
    ".word  0xF63F2901 
	"
    ".word  0x0028AFF6 
	"
    ".word  0x47704770 
	"
  );
}

void loop() {

}

Output of the above program on the Arduino Due target is:

Address: 524617 Res: 3524578: 1338254us
Address: 524669 Res: 3524578: 2058819us

So we confirm the results to be equal and the address during runtime to be as expected. Execution of the manually entered machine code function is 50% slower.

Disassembly with arm-none-eabi-objdump further confirms the respective addresses, flash memory residency and equality of the machine code (Note endianness and byte grouping!):

00080148 <_Z4calcv>:
   80148:   2121        movs    r1, #33 ; 0x21
   8014a:   b532        push    {r1, r4, r5, lr}
   8014c:   f000 f803   bl  80156 <.in>
   80150:   e8bd 4032   ldmia.w sp!, {r1, r4, r5, lr}
   80154:   4770        bx  lr

00080156 <.in>:
   80156:   2501        movs    r5, #1
   80158:   3901        subs    r1, #1
   8015a:   2902        cmp r1, #2
   8015c:   f0c0 800b   bcc.w   80176 <.lblb>
   80160:   2501        movs    r5, #1

00080162 <.lbla>:
   80162:   b522        push    {r1, r5, lr}
   80164:   f7ff fff7   bl  80156 <.in>
   80168:   e8bd 4022   ldmia.w sp!, {r1, r5, lr}
   8016c:   182d        adds    r5, r5, r0
   8016e:   3902        subs    r1, #2
   80170:   2901        cmp r1, #1
   80172:   f63f aff6   bhi.w   80162 <.lbla>

00080176 <.lblb>:
   80176:   0028        movs    r0, r5
   80178:   4770        bx  lr
}
   8017a:   4770        bx  lr

0008017c <_Z5calc2v>:
   8017c:   b5322121    .word   0xb5322121
   80180:   f803f000    .word   0xf803f000
   80184:   4032e8bd    .word   0x4032e8bd
   80188:   25014770    .word   0x25014770
   8018c:   29023901    .word   0x29023901
   80190:   800bf0c0    .word   0x800bf0c0
   80194:   b5222501    .word   0xb5222501
   80198:   fff7f7ff    .word   0xfff7f7ff
   8019c:   4022e8bd    .word   0x4022e8bd
   801a0:   3902182d    .word   0x3902182d
   801a4:   f63f2901    .word   0xf63f2901
   801a8:   0028aff6    .word   0x0028aff6
   801ac:   47704770    .word   0x47704770
}
   801b0:   4770        bx  lr
    ...

We can further confirm the analogously used calling convention:

00080234 <setup>:
void setup() {
   80234:   b508        push    {r3, lr}
  Serial.begin(115200);
   80236:   4806        ldr r0, [pc, #24]   ; (80250 <setup+0x1c>)
   80238:   f44f 31e1   mov.w   r1, #115200 ; 0x1c200
   8023c:   f000 fcb4   bl  80ba8 <_ZN9UARTClass5beginEm>
  timeFnc(calc);
   80240:   4804        ldr r0, [pc, #16]   ; (80254 <setup+0x20>)
   80242:   f7ff ffb7   bl  801b4 <_Z7timeFncPFivE>
}
   80246:   e8bd 4008   ldmia.w sp!, {r3, lr}
  timeFnc(calc2);
   8024a:   4803        ldr r0, [pc, #12]   ; (80258 <setup+0x24>)
   8024c:   f7ff bfb2   b.w 801b4 <_Z7timeFncPFivE>
   80250:   200705cc    .word   0x200705cc
   80254:   00080149    .word   0x00080149
   80258:   0008017d    .word   0x0008017d

I can rule out this being due to some kind of speculative fetch (which the Cortex-M3 seemingly has!) or interrupts. (EDIT: NOPE, I can't. Probably some kind of prefetch) Changing the order of execution or adding function calls in between does not change the result. What could be the culprit here?

EDIT: After changing the alignment of the machine code function (insert nops as a prologue) I get the following results:

+16bit for calc2:

Address: 524617 Res: 3524578: 1102257us
Address: 524669 Res: 3524578: 1846968us

+32bit for calc2:

Address: 524617 Res: 3524578: 1102257us
Address: 524669 Res: 3524578: 1535424us

+48bit for calc2:

Address: 524617 Res: 3524578: 1102155us
Address: 524669 Res: 3524578: 1413180us

+64bit for calc2:

Address: 524617 Res: 3524578: 1102155us
Address: 524669 Res: 3524578: 1346606us

+80bit for calc2:

Address: 524617 Res: 3524578: 1102145us
Address: 524669 Res: 3524578: 1180105us

EDIT2: Only running calc:

Address: 524617 Res: 3524578: 1102155us

Only running calc2:

Address: 524617 Res: 3524578: 1102257us

Changing the order:

Address: 524669 Res: 3524578: 1554160us
Address: 524617 Res: 3524578: 1102211us

EDIT3: Adding .p2align 4 before label .in for calc only, separate execution:

Address: 524625 Res: 3524578: 1413185us

Both as in original benchmark:

Address: 524625 Res: 3524578: 1413185us
Address: 524689 Res: 3524578: 1535424us

EDIT4: Reversing the position in flash completely changes the outcome. -> Linear prefetch?

A.K. · Accepted Answer

The speed of code execution from flash depends on the number of wait cycles and code alignment for each branch target. In this and similar processors, like STM32F103, flash needs 3 wait cycles when the core runs at the highest frequency. This means each taken branch may take between 2 and 5 cycles, which might affect total run time.

To compensate for FLASH slowness, these processors have a wide FLASH bus and a fetch buffer. SAM3X has a pair of 128-bit instruction buffers, which seem to be populated in a prefetch pattern [1].

To optimize a tight loop, try to fit into a 32-byte code block and align it at 16-byte boundary (or better 32, just in case). Also, it could be a good idea to check if FLASH parameters are set up correctly, i.e. prefetch is enabled and bus width is set to 128 bit, in this MCU. Copying code to RAM may be an option, but it's a pain and can actually slow things down, compared to properly working fetch buffers.

[1] http://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-11057-32-bit-Cortex-M3-Microcontroller-SAM3X-SAM3A_Datasheet.pdf, page 294, Figures 18-2, 18-3.

Exact copy of machine code runs 50% slower than the original function

Tags:

performance

assembly

arm

arduino

cortex-m

fscheidl

1 Answers

A.K.

Recent Activity

Donate For Us

Exact copy of machine code runs 50% slower than the original function

Tags:

performance

assembly

arm

arduino

cortex-m

fscheidl

1 Answers

A.K.

Related questions

Recent Activity

Donate For Us