C Is this branchless hack actually faster?

Question

I'm trying to clamp a value between -127 and 127 on a Cortex-M based microcontroller.

I have two competing functions, one uses conditionals the other uses a branchless hack I found here.

// Using conditional statements
int clamp(int val) { return ((val > 127) ? 127 : (val < -127) ? -127 : val); }

// Using branchless hacks
int clamp(int val) {
    val -= -127;
    val &= (~val) >> 31;
    val += -127;
    val -= 127;
    val &= val >> 31;
    val += 127;

    return val;
}

Now I know in some cases one of these methods might be faster than the other, and vise-versa but in general is it worth it to use the branchless technique seeing as it doesn't really matter to me which I use, they both will work just fine in my case?

A little background on the microcontroller, it's a ARM based microcontroller running at 90 MIPS with a 3 stage pipeline, fetch, decode and execute and it seems to have some sort of branch predictor but I couldn't dig up details.

nneonneo · Accepted Answer

ARM code (GCC 4.6.3 with -O3):

clamp1:
    mvn r3, #126
    cmp r0, r3
    movlt   r0, r3
    cmp r0, #127
    movge   r0, #127
    bx  lr

clamp2:
    add r0, r0, #127
    mvn r3, r0
    and r0, r0, r3, asr #31
    sub r0, r0, #254
    and r0, r0, r0, asr #31
    add r0, r0, #127
    bx  lr

Thumb code:

clamp1:
    mvn r3, #126
    cmp r0, r3
    it  lt
    movlt   r0, r3
    cmp r0, #127
    it  ge
    movge   r0, #127
    bx  lr

clamp2:
    adds    r0, r0, #127
    mvns    r3, r0
    and r0, r0, r3, asr #31
    subs    r0, r0, #254
    and r0, r0, r0, asr #31
    adds    r0, r0, #127
    bx  lr

Both are branchless thanks to ARM's conditional execution design. I will bet you they are essentially comparable in performance.

enhzflep · Answer

Something to realize is the the ARM and x86 architectures are very different when it comes to branch instructions. Taking a jump clears the pipeline which can result in the expediture of a number of clock cycles just to 'get back to where you were' in terms of throughput.

To quote a pdf I downloaded the other day (pg14 of http://simplemachines.it/doc/arm_inst.pdf),

Conditional Execution

Most instruction sets only allow branches to be executed conditionally.
However by reusing the condition evaluation hardware, ARM effectively increases number of instructions.
All instructions contain a condition field which determines whether the CPU will execute them.
Non-executed instructions soak up 1 cycle. – Still have to complete cycle so as to allow fetching and decoding of following instructions.
This removes the need for many branches, which stall the pipeline (3 cycles to refill).
Allows very dense in-line code, without branches.
The Time penalty of not executing several conditional instructions is frequently less than overhead of the branch or subroutine call that would otherwise be needed.

C Is this branchless hack actually faster?

Tags:

c

Cody Smith

2 Answers

nneonneo

enhzflep

Recent Activity

Donate For Us

C Is this branchless hack actually faster?

Tags:

c

Cody Smith

2 Answers

nneonneo

enhzflep

Related questions

Recent Activity

Donate For Us