I have some C code that processes data bit-by-bit. Simplified example:
// input data, assume this is initialized
uint32_t data[len];
for (uint32_t idx=0; idx<len; idx++)
{
uint32_t tmp = data[idx];
// iterate over all bits
for (uint8_t pos=0; pos<32; pos++)
{
if (tmp & 0b1)
{
// some code
}
tmp = tmp >> 1;
}
}
In my application len
is relatively large, so I'd like to optimize the inner loop as highly as possible. The // some code
section is small and already heavily optimized.
I'm using an ARM Cortex M0+ MCU which has an instruction to branch if the carry bit is set (see cortex-m0+ manual, page 45). Conviniently shifting bits places the LSB (or MSB) into the carry flag, so in theory it can branch without the comparison like this:
// input data, assume this is initialized
uint32_t data[len];
for (uint32_t idx=0; idx<len; idx++)
{
uint32_t tmp = data[idx];
// iterate over all bits
for (uint8_t pos=0; pos<32; pos++)
{
tmp = tmp >> 1;
if ( CARRY_SET )
{
// some code
}
}
}
What is the best way to archive this with C code and/or inline Assembler? Ideally I'd like to keep the // come code
in C for simplicity and better readability.
Edit 1: I've tested this code on GCC 5.4 GCC 6.3 with -O1, -O2 and -03. For every setting it generates the following assembly code (note the dedicated tst
instruction I try to get rig of):
if (data & 0b1)
00000218 movs r3, #1
0000021A tst r3, r6
0000021C beq #4
Edit 2: minimal reproducable example. I'm writing the code in Atmel Studio 7 (because it's intended for a MCU) and inspect the values int the build-in debugger. If you use a different enviroment, you may need to add som IO code:
int main(void)
{
uint32_t tmp = 0x12345678;
volatile uint8_t bits = 0; // volatile needed in this example to prevent compiler from optimizing away all code.
// iterate over all bits
for (uint8_t pos=0; pos<32; pos++)
{
if (tmp & 1)
{
bits++; // the real code isn't popcount. Some compilers may transform this example loop into a different popcount algorithm if bits wasn't volatile.
}
tmp = tmp >> 1;
}
// read bits here with debugger
while(1);
}
I did not find an "easy" solution, so I had to write my short algorithm in assembler. This is what the demo code looks like:
// assume these values as initialized
uint32_t data[len]; // input data bit stream
uint32_t out; // algorithm input + output
uint32_t in; // algorithm input (value never written in asm)
for (uint32_t idx=0; idx<len; idx++)
{
uint32_t tmp = data[idx];
// iterate over all bits
for (uint8_t pos=0; pos<32; pos++)
{
// use optimized code only on supported devices
#if defined(__CORTEX_M) && (__CORTEX_M <= 4)
asm volatile // doesn't need to be volatile if you use the result
(
"LSR %[tmp], %[tmp], #1" "\n\t" // shift data by one. LSB is now in carry
"BCC END_%=" "\n\t" // branch if carry clear (LSB was not set)
/* your code here */ "\n\t"
"END_%=:" "\n\t" // label only, doesn't generate any instructions
: [tmp]"+l"(tmp), [out]"+l"(out) // out; l = register 0..7 = general purpose registers
: [in]"l"(in) // in;
: "cc" // clobbers: "cc" = CPU status flags have changed
// Add any other registers you use as temporaries, or use dummy output operands to let the compiler pick registers.
);
#else
if (tmp & 0b1)
{
// some code
}
tmp = tmp >> 1;
#endif
}
}
For your application, add your assembly code at the marked location, and feed in data from the C function with the registers. Keep in mind that in Thumb mode, many instructions can only use 8 of the 16 general purpose registers, so you can't pass more values than that.
Inline assembly is very easy to get wrong in subtle ways that appear to work but can break after inlining into different surrounding code. (For example, forgetting to declare a clobber.) https://gcc.gnu.org/wiki/DontUseInlineAsm unless you need to (including for performance), but if so make sure you check the docs (https://stackoverflow.com/tags/inline-assembly/info).
Note that technically the correct shift instruction is LSRS
(with an s
suffix to set flags). However on GCC 6.3 + GAS writing lsrs
in the asm code will cause an error assembling in thumb mode, but if you write lsr
it assembles successfully into an lsrs
instruction. (In ARM mode, which Cortex-M doesn't support, lsr
and lsrs
both assemble to separate instructions as expected.)
Although I can not share my application code, I can tell you how much speedup this change had:
-O1 | -O2 | -O3 | |
---|---|---|---|
original | 812us | 780us | 780us |
w/ asm | 748us | 686us | 716us |
w/ asm + some loop unrolling | 732us | 606us | 648us |
So with my the ASM code and -O2 instead of -O1 I get a speedup of 15% and with additional loop unrolling I got a speedup of 25%.
Placing the function in RAM with __attribute__ ((section(".ramfunc")))
yields another 1% improvement. (Make sure to test this on your device, some MCUs have awful flash cache miss penalties.)
See old_timer's answer below for more general purpose optimizations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With