Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARM Cortex M0+: How to use "Branch if Carry" instructions in C-code?

I have some C code that processes data bit-by-bit. Simplified example:

// input data, assume this is initialized
uint32_t data[len];

for (uint32_t idx=0; idx<len; idx++)
{
    uint32_t tmp = data[idx];
    
    // iterate over all bits
    for (uint8_t pos=0; pos<32; pos++)
    {
        if (tmp & 0b1)
        {
            // some code
        }
    
        tmp = tmp >> 1;
    }
}

In my application len is relatively large, so I'd like to optimize the inner loop as highly as possible. The // some code section is small and already heavily optimized.

I'm using an ARM Cortex M0+ MCU which has an instruction to branch if the carry bit is set (see cortex-m0+ manual, page 45). Conviniently shifting bits places the LSB (or MSB) into the carry flag, so in theory it can branch without the comparison like this:

// input data, assume this is initialized
uint32_t data[len];

for (uint32_t idx=0; idx<len; idx++)
{
    uint32_t tmp = data[idx];

    // iterate over all bits
    for (uint8_t pos=0; pos<32; pos++)
    {
        tmp = tmp >> 1;

        if ( CARRY_SET )
        {
            // some code
        }
    }
}

What is the best way to archive this with C code and/or inline Assembler? Ideally I'd like to keep the // come code in C for simplicity and better readability.


Edit 1: I've tested this code on GCC 5.4 GCC 6.3 with -O1, -O2 and -03. For every setting it generates the following assembly code (note the dedicated tst instruction I try to get rig of):

        if (data & 0b1)             
00000218   movs r3, #1       
0000021A   tst  r3, r6       
0000021C   beq  #4

Edit 2: minimal reproducable example. I'm writing the code in Atmel Studio 7 (because it's intended for a MCU) and inspect the values int the build-in debugger. If you use a different enviroment, you may need to add som IO code:

int main(void)
{
    uint32_t tmp = 0x12345678;
    volatile uint8_t bits = 0;  // volatile needed in this example to prevent compiler from optimizing away all code.

    // iterate over all bits
    for (uint8_t pos=0; pos<32; pos++)
    {
        if (tmp & 1)
        {
            bits++;    // the real code isn't popcount.  Some compilers may transform this example loop into a different popcount algorithm if bits wasn't volatile.
        }
        tmp = tmp >> 1;
    }

    // read bits here with debugger
    while(1);
}
like image 339
nqtronix Avatar asked Mar 01 '23 12:03

nqtronix


1 Answers

I did not find an "easy" solution, so I had to write my short algorithm in assembler. This is what the demo code looks like:

// assume these values as initialized
uint32_t data[len];     // input data bit stream
uint32_t out;           // algorithm input + output
uint32_t in;            // algorithm input (value never written in asm)

for (uint32_t idx=0; idx<len; idx++)
{
    uint32_t tmp = data[idx];
    
    // iterate over all bits
    for (uint8_t pos=0; pos<32; pos++)
    {
        // use optimized code only on supported devices
        #if defined(__CORTEX_M) && (__CORTEX_M <= 4)
        asm volatile  // doesn't need to be volatile if you use the result
        (
            "LSR    %[tmp], %[tmp], #1" "\n\t"  // shift data by one. LSB is now in carry
            "BCC    END_%="             "\n\t"  // branch if carry clear (LSB was not set)
            
            /* your code here */        "\n\t"
        
            "END_%=:"                   "\n\t"  // label only, doesn't generate any instructions
        
            : [tmp]"+l"(tmp), [out]"+l"(out)    // out; l = register 0..7 = general purpose registers
            : [in]"l"(in)                       // in;
            : "cc"                              // clobbers: "cc" = CPU status flags have changed
                               // Add any other registers you use as temporaries, or use dummy output operands to let the compiler pick registers.
        );
        #else
        if (tmp & 0b1)
        {
            // some code
        }
        tmp = tmp >> 1;
        #endif
    }
}

For your application, add your assembly code at the marked location, and feed in data from the C function with the registers. Keep in mind that in Thumb mode, many instructions can only use 8 of the 16 general purpose registers, so you can't pass more values than that.

Inline assembly is very easy to get wrong in subtle ways that appear to work but can break after inlining into different surrounding code. (For example, forgetting to declare a clobber.) https://gcc.gnu.org/wiki/DontUseInlineAsm unless you need to (including for performance), but if so make sure you check the docs (https://stackoverflow.com/tags/inline-assembly/info).

Note that technically the correct shift instruction is LSRS (with an s suffix to set flags). However on GCC 6.3 + GAS writing lsrs in the asm code will cause an error assembling in thumb mode, but if you write lsr it assembles successfully into an lsrs instruction. (In ARM mode, which Cortex-M doesn't support, lsr and lsrs both assemble to separate instructions as expected.)


Although I can not share my application code, I can tell you how much speedup this change had:

-O1 -O2 -O3
original 812us 780us 780us
w/ asm 748us 686us 716us
w/ asm + some loop unrolling 732us 606us 648us

So with my the ASM code and -O2 instead of -O1 I get a speedup of 15% and with additional loop unrolling I got a speedup of 25%.

Placing the function in RAM with __attribute__ ((section(".ramfunc"))) yields another 1% improvement. (Make sure to test this on your device, some MCUs have awful flash cache miss penalties.)

See old_timer's answer below for more general purpose optimizations.

like image 117
nqtronix Avatar answered Mar 04 '23 02:03

nqtronix