Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARM Assembly: Absolute Value Function: Are two or three lines faster?

In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly. We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it.

The C-code:

unsigned long absval(signed long x){
    unsigned long int signext;
    signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction
    return (x + signet) ^ signext;
}

The TA/Professor's 3-line solution

ASR R1, R0, #31         ; R1 <- (x >= 0) ? 0 : -1
ADD R0, R0, R1          ; R0 <- R0 + R1
EOR R0, R0, R1          ; R0 <- R0 ^ R1

My 2-line solution

ADD R1, R0, R0, ASR #31 ; R1 <- x  + (x >= 0) ? 0 : -1
EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1

There are a couple of places I can see potential performance differences:

  1. The addition of one extra Arithmetic Shift Right call
  2. The removal of one memory fetch

So, which one is actually faster? Does it depend upon the processor or memory access speed?

like image 278
Ken W Avatar asked May 11 '13 16:05

Ken W


People also ask

What does BX LR do in assembly?

With these instructions, the return address will be stored in the link register (LR) and the function can be terminated using BX LR, which causes program control to return to the calling process.

What is BLT in ARM assembly?

Branch if less than (blt) The blt instruction compares 2 registers, treating them as signed integers, and takes a branch if one register is less than another.

What is R and M in assembly?

Rn: An operand in a register for an arithmetic operation. Rm: An operand in a register for an arithmetic operation. Ra: A value in a register to be used in an addition or subtraction. Think "accumulator"

What is offset in ARM?

The ARM instruction set architecture has three addressing modes: Immediate. The offset is an unsigned integer that is stored as part of the instruction. It can be added to or subtracted from the value in the base register.


2 Answers

Here is a nother two instruction version:

    cmp     r0, #0
    rsblt   r0, r0, #0

Which translate to the simple code:

  if (r0 < 0)
  {
    r0 = 0-r0;
  }

That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.

like image 78
Nils Pipenbrinck Avatar answered Oct 16 '22 17:10

Nils Pipenbrinck


Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.

We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:

ASR R1, R0, #31         ; 1 cycle
ADD R0, R0, R1          ; 1 cycle
EOR R0, R0, R1          ; 1 cycle
                        ; total: 3 cycles

and your version takes two cycles:

ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
                        ; total: 2 cycles

So yours is, theoretically, faster.

You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:

Their version (adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    asrs r1, r0, #31
    adds r0, r0, r1
    eors r0, r0, r1

Assembles to:

00000000        17c1    asrs    r1, r0, #31
00000002        1840    adds    r0, r0, r1
00000004        4048    eors    r0, r1

That's 3x2 = 6 bytes.

Your version (again, adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    add.w r1, r0, r0, asr #31
    eor.w r0, r1, r0, asr #31

Assembles to:

00000000    eb0071e0    add.w   r1, r0, r0, asr #31
00000004    ea8170e0    eor.w   r0, r1, r0, asr #31

That's 2x4 = 8 bytes.

So instead of removing a memory fetch you've actually increased the size of the code.

But does this affect performance? My advice would be to benchmark.

like image 4
David Thomas Avatar answered Oct 16 '22 17:10

David Thomas