This code (arm):
void blinkRed(void)
{
for(;;)
{
bb[0x0008646B] ^= 1;
sys.Delay_ms(14);
}
}
...is compiled to folowing asm-code:
08000470: ldr r4, [pc, #20] ; (0x8000488 <blinkRed()+24>) // r4 = 0x422191ac
08000472: ldr r6, [pc, #24] ; (0x800048c <blinkRed()+28>)
08000474: movs r5, #14
08000476: ldr r3, [r4, #0]
08000478: eor.w r3, r3, #1
0800047c: str r3, [r4, #0]
0800047e: mov r0, r6
08000480: mov r1, r5
08000482: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
08000486: b.n 0x8000476 <blinkRed()+6>
It is ok.
But, if I just change array index (-0x400
)....
void blinkRed(void)
{
for(;;)
{
bb[0x0008606B] ^= 1;
sys.Delay_ms(14);
}
}
...I've got not so optimized code:
08000470: ldr r4, [pc, #24] ; (0x800048c <blinkRed()+28>) // r4 = 0x42218000
08000472: ldr r6, [pc, #28] ; (0x8000490 <blinkRed()+32>)
08000474: movs r5, #14
08000476: ldr.w r3, [r4, #428] ; 0x1ac
0800047a: eor.w r3, r3, #1
0800047e: str.w r3, [r4, #428] ; 0x1ac
08000482: mov r0, r6
08000484: mov r1, r5
08000486: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
0800048a: b.n 0x8000476 <blinkRed()+6>
The difference is that in the first case r4
is loaded with target address immediately (0x422191ac
) and then access to memory is performed with 2-byte instructions, but in the second case r4
is loaded with some intermediate
address (0x42218000
) and then access to memory is performed with 4-bytes instruction with offset (+0x1ac
) to target address (0x422181ac
).
Why compiler does so?
I use:
arm-none-eabi-g++ -mcpu=cortex-m3 -mthumb -g2 -Wall -O1 -std=gnu++14 -fno-exceptions -fno-use-cxa-atexit -fstrict-volatile-bitfields -c -DSTM32F100C6T6B -DSTM32F10X_LD_VL
bb
is:
__attribute__ ((section(".bitband"))) volatile u32 bb[0x00800000];
In .ld
it is defined as:
in MEMORY
section:
BITBAND(rwx): ORIGIN = 0x42000000, LENGTH = 0x02000000
in SECTIONS
section:
.bitband (NOLOAD) :
SUBALIGN(0x02000000)
{
KEEP(*(.bitband))
} > BITBAND
I would consider it an artefact/missing optimization opportunity of -O1.
It can be understood in more detail if we look at the code generated with -O- to load bb[...]
:
First case:
movw r2, #:lower16:bb
movt r2, #:upper16:bb
movw r3, #37292
movt r3, 33
adds r3, r2, r3
ldr r3, [r3, #0]
Second case:
movw r3, #:lower16:bb
movt r3, #:upper16:bb
add r3, r3, #2195456 ; 0x218000 = 4*0x86000
add r3, r3, #428
ldr r3, [r3, #0]
The code in the second case is better and it can be done this way because the constant can be added with two add instructions (which is not the case if the index is 0x0008646B).
-O1 does only optimizations which are not time consuming. So apparently it merges early the add and the ldr so it misses later the opportunity to load the whole address with one pc relative ldr.
Compile with -O2 (or -fgcse) and the code looks like expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With