I'm having a hard time beating my compiler using inline assembly. What's a good, non-contrived examples of a function which the compiler has a hard time making really, really fast and simple? But that's relatively simple to make with inline assembly.

Since it's related to the iPhone and assembly code then I'll give an example that would be relevant in iPhone world (and not some sse or x86 asm). If anybody decides to write assembly code for some real world app, then most likely this is going to be some sort of digital signal processing or image manipulation. Examples: converting colorspace of RGB pixels, encoding images to jpeg/png format, or encoding sound to mp3, amr or g729 for voip applications. In case of sound encoding there are many routines that cannot be translated by the compiler to efficient asm code, they simply have no equivalent in C. Examples of the commonly used stuff in sound processing: saturated math, multiply-accumulate routines, matrix multiplication. Example of saturated add: 32-bit signed int has range: 0x8000 0000 <= int32 <= 0x7fff ffff. If you add two ints result could overflow, but this could be unacceptable in certain cases in digital signal processing. Basically, if result overflows or underflows saturated add should return 0x8000 0000 or 0x7fff ffff. That would be a full c function to check that. an optimized version of saturated add could be: <pre class="prettyprint"> int saturated_add(int a, int b) { int result = a + b; if (((a ^ b) & 0x80000000) == 0) { if ((result ^ a) & 0x80000000) { result = (a < 0) ? 0x80000000 : 0x7fffffff; } } return result; } </pre> you may also do multiple if/else to check for overflow or on x86 you may check overflow flag (which also requires you to use asm). iPhone uses armv6 or v7 cpu which have dsp asm. So, the <code>saturated_add</code> function with multiple brunches (if/else statements) and 2 32-bit constants could be one simple asm instruction that uses only one cpu cycle. So, simply making saturated_add to use asm instruction could make entire algorithm two-three times faster (and smaller in size). Here's the QADD manual: QADD other examples of code that often executed in long loops are <pre class="prettyprint"> res1 = a + b1*c1; res2 = a + b2*c2; res3 = a + b3*c3; </pre> seems like nothing can't be optimized here, but on ARM cpu you can use specific dsp instructions that take less cycles than to do simple multiplication! That's right, a+b * c with specific instructions could execute faster than simple a*b. For this kind of cases compilers simply cannot understand logic of your code and can't use these dsp instructions directly and that's why you need to manually write asm to optimize code, BUT you should only manually write some parts of code that do need to be optimized. If you start writing simple loops manually then almost certainly you won't beat the compiler! There are multiple good papers on the web for inline assembly to code fir filters, amr encoding/decoding etc.

What's an example of a simple C function which is faster implemented in inline assembly?

2 Answers

If you don't consider SIMD operations cheating, you can usually write SIMD assembly that performs much better than your compilers autovectorization abilities (If it even has autovectorization!)

Here's a very basic SSE(One of x86's SIMD instruction sets) tutorial. It's for Visual C++ in-line assembly.

Edit: Here's a small pair of functions if you want to try for yourself. It's the calculation of an n length dot product. One is using SSE 2 instructions in-line (GCC in-line syntax) the other is very basic C.

It's very very simple and I'd be very surprised if a good compiler couldn't vectorize the simple C loop, but if it doesn't you should see a speed up in the SSE2. The SSE 2 version could probably be faster if I used more registers but I don't want to stretch my very weak SSE skills :).

 float dot_asm(float *a, float*b, int n)
{
  float ans = 0;
  int i; 
  // I'm not doing checking for size % 8 != 0 arrays.
  while( n > 0) {
    float tmp[4] __attribute__ ((aligned(16)));

     __asm__ __volatile__(
            "xorps      %%xmm0, %%xmm0\n\t"
            "movups     (%0), %%xmm1\n\t"
            "movups     16(%0), %%xmm2\n\t"
            "movups     (%1), %%xmm3\n\t"
            "movups     16(%1), %%xmm4\n\t"
            "add        $32,%0\n\t"
            "add        $32,%1\n\t"
            "mulps      %%xmm3, %%xmm1\n\t"
            "mulps      %%xmm4, %%xmm2\n\t"
            "addps      %%xmm2, %%xmm1\n\t"
            "addps      %%xmm1, %%xmm0"
            :"+r" (a), "+r" (b)
            :
            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");

    __asm__ __volatile__(
        "movaps     %%xmm0, %0"
        : "=m" (tmp)
        : 
        :"xmm0", "memory" );             

   for(i = 0; i < 4; i++) {
      ans += tmp[i];
   }
   n -= 8;
  }
  return ans;
}

float dot_c(float *a, float *b, int n) {

  float ans = 0;
  int i;
  for(i = 0;i < n; i++) {
    ans += a[i]*b[i];
  }
  return ans;
}

answered Sep 27 '22 16:09

Falaina

Since it's related to the iPhone and assembly code then I'll give an example that would be relevant in iPhone world (and not some sse or x86 asm). If anybody decides to write assembly code for some real world app, then most likely this is going to be some sort of digital signal processing or image manipulation. Examples: converting colorspace of RGB pixels, encoding images to jpeg/png format, or encoding sound to mp3, amr or g729 for voip applications. In case of sound encoding there are many routines that cannot be translated by the compiler to efficient asm code, they simply have no equivalent in C. Examples of the commonly used stuff in sound processing: saturated math, multiply-accumulate routines, matrix multiplication.

Example of saturated add: 32-bit signed int has range: 0x8000 0000 <= int32 <= 0x7fff ffff. If you add two ints result could overflow, but this could be unacceptable in certain cases in digital signal processing. Basically, if result overflows or underflows saturated add should return 0x8000 0000 or 0x7fff ffff. That would be a full c function to check that. an optimized version of saturated add could be:

int saturated_add(int a, int b)
{
    int result = a + b;

    if (((a ^ b) & 0x80000000) == 0)
    {
        if ((result ^ a) & 0x80000000)
        {
            result = (a < 0) ? 0x80000000 : 0x7fffffff;
        }
    }
    return result;
}

you may also do multiple if/else to check for overflow or on x86 you may check overflow flag (which also requires you to use asm). iPhone uses armv6 or v7 cpu which have dsp asm. So, the saturated_add function with multiple brunches (if/else statements) and 2 32-bit constants could be one simple asm instruction that uses only one cpu cycle. So, simply making saturated_add to use asm instruction could make entire algorithm two-three times faster (and smaller in size). Here's the QADD manual: QADD

other examples of code that often executed in long loops are

res1 = a + b1*c1;
res2 = a + b2*c2;
res3 = a + b3*c3;

seems like nothing can't be optimized here, but on ARM cpu you can use specific dsp instructions that take less cycles than to do simple multiplication! That's right, a+b * c with specific instructions could execute faster than simple a*b. For this kind of cases compilers simply cannot understand logic of your code and can't use these dsp instructions directly and that's why you need to manually write asm to optimize code, BUT you should only manually write some parts of code that do need to be optimized. If you start writing simple loops manually then almost certainly you won't beat the compiler! There are multiple good papers on the web for inline assembly to code fir filters, amr encoding/decoding etc.

answered Sep 27 '22 16:09

pps

Related questions
                            
                                How do I decompile a .hex file into C++ for Arduino?
                            
                                Difference between .equ and .word in ARM Assembly?
                            
                                Linux kernel assembly and logic
                            
                                With variable length instructions how does the computer know the length of the instruction being fetched? [duplicate]
                            
                                Programming Environment for a Motorola 68000 in Linux
                            
                                LOOP, LOOPE, LOOPNE?
                            
                                Equivalent for GCC's naked attribute
                            
                                How does argument passing work?
                            
                                Why is this inline assembly not working with a separate asm volatile statement for each instruction?
                            
                                How to make the kernel for my bootloader?
                            
                                Are programming languages and methods inefficient? (assembler and C knowledge needed)
                            
                                AND faster than integer modulo operation?
                            
                                What kind of projects (besides the obvious OS stuff) use assembly language?
                            
                                Is mov %esi, %esi a no-op or not on x86-64?
                            
                                GCC C++ Exception Handling Implementation
                            
                                Understanding MRC on ARM7
                            
                                Why is 1.0f in C code represented as 1065353216 in the generated assembly?
                            
                                JMP to absolute address (op codes)
                            
                                What does the bracket in `movl (%eax), %eax` mean?
                            
                                Why do Off-the-shelf applications work on both Intel and AMD processors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's an example of a simple C function which is faster implemented in inline assembly?

Tags:

assembly

inline-assembly

Hans Sjunnesson

People also ask

2 Answers

Falaina

pps

Recent Activity

Donate For Us