When trying to address individual bytes inside an uint64
, AVR gcc⁽¹⁾ gives me a strange prologue/epilogue, while the same function written using uint32_t
gives me a single ret
(the example function is a NOP).
Why does gcc do this? How do I remove this?
You can see the code here, in Compiler Explorer.
⁽¹⁾ gcc 5.4.0 from Arduino 1.8.9 distribution, parameters=-O3 -std=c++11
.
Source code:
#include <stdint.h>
uint32_t f_u32(uint32_t x) {
union y {
uint8_t p[4];
uint32_t w;
};
return y{ .p = {
y{ .w = x }.p[0],
y{ .w = x }.p[1],
y{ .w = x }.p[2],
y{ .w = x }.p[3]
} }.w;
}
uint64_t f_u64(uint64_t x) {
union y {
uint8_t p[8];
uint64_t w;
};
return y{ .p = {
y{ .w = x }.p[0],
y{ .w = x }.p[1],
y{ .w = x }.p[2],
y{ .w = x }.p[3],
y{ .w = x }.p[4],
y{ .w = x }.p[5],
y{ .w = x }.p[6],
y{ .w = x }.p[7]
} }.w;
}
Generated assembly for the uint32_t
version:
f_u32(unsigned long):
ret
Generated assembly for the uint64_t
version:
f_u64(unsigned long long):
push r28
push r29
in r28,__SP_L__
in r29,__SP_H__
subi r28,72
sbc r29,__zero_reg__
in __tmp_reg__,__SREG__
cli
out __SP_H__,r29
out __SREG__,__tmp_reg__
out __SP_L__,r28
subi r28,-72
sbci r29,-1
in __tmp_reg__,__SREG__
cli
out __SP_H__,r29
out __SREG__,__tmp_reg__
out __SP_L__,r28
pop r29
pop r28
ret
I am not sure if this is a good answer, but it is the best I can give. The assembly for the f_u64()
function allocates 72 bytes on the stack and then deallocates them again (since this involves registers r28
and r29
, they are saved in the beginning and restored in the end).
If you try to compile without optimization (I also skipped the c++11
flag, I do not think it makes any difference), then you will see that the f_u64()
function starts by allocating 80 bytes on the stack (similar to the opening statements you see in the optimized code, just with 80 bytes instead of 72 bytes):
in r28,__SP_L__
in r29,__SP_H__
subi r28,80
sbc r29,__zero_reg__
in __tmp_reg__,__SREG__
cli
out __SP_H__,r29
out __SREG__,__tmp_reg__
out __SP_L__,r28
These 80 bytes are actually all used. First the value of the argument x
is stored (8 bytes) and then a lot of moving data around is done involving the remaining 72 bytes.
After that the 80 bytes are deallocated on the stack similar to the closing statements in the optimized code:
subi r28,-80
sbci r29,-1
in __tmp_reg__,__SREG__
cli
out __SP_H__,r29
out __SREG__,__tmp_reg__
out __SP_L__,r28
My guess is that the optimizer concludes that the 8 bytes for storing the argument can be spared. Hence it needs only 72 bytes. Then it concludes that all the moving around of data can be spared. However, it fails to figure out that this means that the 72 bytes on the stack can be spared.
Hence my best bet is that this is a limitation or an error in the optimizer (whatever you prefer to call it). In that case the only "solution" is to try to shuffle the real code around to find a work-around or raise it as an error on the compiler.
You asked how to remove the inefficient code. My answer to your question is that you can just get rid of your function, since it's not performing any calculation and just returning the same value that was passed to it.
If you want to still be able to call that function in other code for some reason, I would do:
#define f_u64(x) ((uint64_t)(x))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With