In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly. We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it. The C-code: <pre class="prettyprint"><code>unsigned long absval(signed long x){ unsigned long int signext; signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction return (x + signet) ^ signext; } </code></pre> The TA/Professor's 3-line solution <pre class="prettyprint"><code>ASR R1, R0, #31 ; R1 <- (x >= 0) ? 0 : -1 ADD R0, R0, R1 ; R0 <- R0 + R1 EOR R0, R0, R1 ; R0 <- R0 ^ R1 </code></pre> My 2-line solution <pre class="prettyprint"><code>ADD R1, R0, R0, ASR #31 ; R1 <- x + (x >= 0) ? 0 : -1 EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1 </code></pre> There are a couple of places I can see potential performance differences: <ol> <li>The addition of one extra Arithmetic Shift Right call</li> <li>The removal of one memory fetch</li> </ol> So, which one is actually faster? Does it depend upon the processor or memory access speed?

Here is a nother two instruction version: <pre class="prettyprint"><code> cmp r0, #0 rsblt r0, r0, #0 </code></pre> Which translate to the simple code: <pre class="prettyprint"><code> if (r0 < 0) { r0 = 0-r0; } </code></pre> That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.

Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3. We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles: <pre class="prettyprint"><code>ASR R1, R0, #31 ; 1 cycle ADD R0, R0, R1 ; 1 cycle EOR R0, R0, R1 ; 1 cycle ; total: 3 cycles </code></pre> and your version takes two cycles: <pre class="prettyprint"><code>ADD R1, R0, R0, ASR #31 ; 1 cycle EOR R0, R1, R0, ASR #31 ; 1 cycle ; total: 2 cycles </code></pre> So yours is, theoretically, faster. You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble: Their version (adjusted for UAL syntax): <pre class="prettyprint"><code> .syntax unified .text .thumb abs: asrs r1, r0, #31 adds r0, r0, r1 eors r0, r0, r1 </code></pre> Assembles to: <pre class="prettyprint"><code>00000000 17c1 asrs r1, r0, #31 00000002 1840 adds r0, r0, r1 00000004 4048 eors r0, r1 </code></pre> That's 3x2 = 6 bytes. Your version (again, adjusted for UAL syntax): <pre class="prettyprint"><code> .syntax unified .text .thumb abs: add.w r1, r0, r0, asr #31 eor.w r0, r1, r0, asr #31 </code></pre> Assembles to: <pre class="prettyprint"><code>00000000 eb0071e0 add.w r1, r0, r0, asr #31 00000004 ea8170e0 eor.w r0, r1, r0, asr #31 </code></pre> That's 2x4 = 8 bytes. So instead of removing a memory fetch you've actually increased the size of the code. But does this affect performance? My advice would be to benchmark.

ARM Assembly: Absolute Value Function: Are two or three lines faster?

Tags:

performance

optimization

assembly

arm

cortex-m3

In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly. We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it.

The C-code:

unsigned long absval(signed long x){
    unsigned long int signext;
    signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction
    return (x + signet) ^ signext;
}

The TA/Professor's 3-line solution

ASR R1, R0, #31         ; R1 <- (x >= 0) ? 0 : -1
ADD R0, R0, R1          ; R0 <- R0 + R1
EOR R0, R0, R1          ; R0 <- R0 ^ R1

My 2-line solution

ADD R1, R0, R0, ASR #31 ; R1 <- x  + (x >= 0) ? 0 : -1
EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1

There are a couple of places I can see potential performance differences:

The addition of one extra Arithmetic Shift Right call
The removal of one memory fetch

So, which one is actually faster? Does it depend upon the processor or memory access speed?

278

asked May 11 '13 16:05

Ken W

2 Answers

Here is a nother two instruction version:

    cmp     r0, #0
    rsblt   r0, r0, #0

Which translate to the simple code:

  if (r0 < 0)
  {
    r0 = 0-r0;
  }

That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.

answered Oct 16 '22 17:10

Nils Pipenbrinck

Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.

We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:

ASR R1, R0, #31         ; 1 cycle
ADD R0, R0, R1          ; 1 cycle
EOR R0, R0, R1          ; 1 cycle
                        ; total: 3 cycles

and your version takes two cycles:

ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
                        ; total: 2 cycles

So yours is, theoretically, faster.

You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:

Their version (adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    asrs r1, r0, #31
    adds r0, r0, r1
    eors r0, r0, r1

Assembles to:

00000000        17c1    asrs    r1, r0, #31
00000002        1840    adds    r0, r0, r1
00000004        4048    eors    r0, r1

That's 3x2 = 6 bytes.

Your version (again, adjusted for UAL syntax):

    .syntax unified
    .text
    .thumb
abs:
    add.w r1, r0, r0, asr #31
    eor.w r0, r1, r0, asr #31

Assembles to:

00000000    eb0071e0    add.w   r1, r0, r0, asr #31
00000004    ea8170e0    eor.w   r0, r1, r0, asr #31

That's 2x4 = 8 bytes.

So instead of removing a memory fetch you've actually increased the size of the code.

But does this affect performance? My advice would be to benchmark.

answered Oct 16 '22 17:10

David Thomas

Related questions
                            
                                Vim running slow with LaTeX files
                            
                                MinGW "make" starts very slowly
                            
                                Performance of modern processor
                            
                                Need help improving performance of PowerShell delimited-text parsing script
                            
                                php slow lag when login
                            
                                Increase speed of MySQL LIKE query?
                            
                                Determining the quadrant of a point
                            
                                C++ - the fastest integer type?
                            
                                What's the most memory efficient way to generate the combinations of a set in python?
                            
                                Algorithm: A Better Way To Calculate Frequencies of a list of words
                            
                                Own WinForms Control Flickers and Has Bad Performance
                            
                                SQL Server performance - IF Select + Update vs Update
                            
                                When is it better not to use inner join?
                            
                                Parse ~1 MB JSON on Android very slow
                            
                                Android draw path
                            
                                Drawing lots of particles efficiently
                            
                                Is a pointer indirection more costly than a conditional?
                            
                                Why is one thread faster than just calling a function, mingw
                            
                                Splitting Delimited String vs JSON Parsing Efficiency in JavaScript
                            
                                Java: Concurrent reads on an InputStream

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With