For the following function...
uint16_t swap(const uint16_t value)
{
return value << 8 | value >> 8;
}
...why does ARM gcc 6.3.0 with -O2 yield the following assembly?
swap(unsigned short):
lsr r3, r0, #8
orr r0, r3, r0, lsl #8
lsl r0, r0, #16 # shift left
lsr r0, r0, #16 # shift right
bx lr
It appears the compiler is using two shifts to mask off the unwanted bytes, instead of using a logical AND. Could the compiler instead use and r0, r0, #4294901760
?
Older ARM assembly cannot create constants easily. Instead, they are loaded into literal pools and then read in via a memory load. This and
you suggest can only take I believe an 8-bit literal with shift. Your 0xFFFF0000
requires 16-bits to do as 1 instructions.
So, we can load from memory and do an and
(slow),
Take 2 instructions to create the value and 1 to and (longer),
or just shift twice cheaply and call it good.
The compiler chose the shifts and honestly, it is plenty fast.
Now for a reality check:
Worrying about a single shift, unless this is a 100% for sure bottleneck is a waste of time. Even if the compiler was sub-optimal, you will almost never feel it. Worry about "hot" loops in code instead for micro-ops like this. Looking at this from curiosity is awesome. Worrying about this exact code for performance in your app, not so much.
Edit:
It has been noted by others here that newer versions of the ARM specifications allow this sort of thing to be done more efficiently. This shows that it is important, when talking at this level, to specify the Chip or at least the exact ARM spec we are dealing with. I was assuming ancient ARM from the lack of "newer" instructions given from your output. If we are tracking compiler bugs, then this assumption may not hold and knowing the specification is even more important. For a swap like this, there are indeed simpler instructions to handle this in later versions.
Edit 2
One thing that could be done to possibly make this faster is to make it inline'd. In that case, the compiler could interleave these operations with other work. Depending on the CPU, this could double the throughput here as many ARM CPUs have 2 integer instruction pipelines. Spread out the instructions enough so that there are no hazards, and away it goes. This has to be weighed against I-Cache usage, but in a case where it mattered, you could see something better.
There is a missed-optimization here, but and
isn't the missing piece. Generating a 16-bit constant isn't cheap. For a loop, yes it would be a win to generate a constant outside the loop and use just and
inside the loop. (TODO: call swap
in a loop over an array and see what kind of code we get.)
For an out-of-order CPU, it could also be worth using multiple instructions off the critical path to build a constant, then you only have one AND
on the critical path instead of two shifts. But that's probably rare, and not what gcc chooses.
AFAICT (from looking at compiler output for simple functions), the ARM calling convention guarantees there's no high garbage in input registers, and doesn't allow leaving high garbage in return values. i.e. on input, it can assume that the upper 16 bits of r0
are all zero, but must leave them zero on return. The value << 8
left shift is thus a problem, but the value >> 8
isn't (it doesn't have to worry about shifting garbage down into the low 16).
(Note that x86 calling conventions aren't like this: return values are allowed to have high garbage. (Maybe because the caller can simply use the 16-bit or 8-bit partial register). So are input values, except as an undocumented part of the x86-64 System V ABI: clang depends on input values being sign/zero extended to 32-bit. GCC provides this when calling, but doesn't assume as a callee.)
ARMv6 has a rev16
instruction which byte-swaps the two 16-bit halves of a register. If the upper 16 bits are already zeroed, they don't need to be re-zeroed, so gcc -march=armv6
should compile the function to just rev16
. But in fact it emits a uxth
to extract and zero-extend the low half-word. (i.e. exactly the same thing as and
with 0x0000FFFF
, but without needing a large constant). I believe this is pure missed optimization; presumably gcc's rotate idiom, or its internal definition for using rev16
that way, doesn't include enough info to let it realize the top half stays zeroed.
swap: @@ gcc6.3 -O3 -march=armv6 -marm
rev16 r0, r0
uxth r0, r0 @ not needed
bx lr
For ARM pre v6, a shorter sequence is possible. GCC only finds it if we hand-hold it towards the asm we want:
// better on pre-v6, worse on ARMv6 (defeats rev16 optimization)
uint16_t swap_prev6(const uint16_t value)
{
uint32_t high = value;
high <<= 24; // knock off the high bits
high >>= 16; // and place the low8 where we want it
uint8_t low = value >> 8;
return high | low;
//return value << 8 | value >> 8;
}
swap_prev6: @ gcc6.3 -O3 -marm. (Or armv7 -mthumb for thumb2)
lsl r3, r0, #24
lsr r3, r3, #16
orr r0, r3, r0, lsr #8
bx lr
But this defeats the gcc's rotate-idiom recognition, so it compiles to this same code even with -march=armv6
when the simple version compiles to rev16
/ uxth
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With