Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When is assembly faster than C? [closed]

People also ask

Is C or assembly more efficient?

The reason C is faster than assembly is because the only way to write optimal code is to measure it on a real machine, and with C you can run many more experiments, much faster.

Is C the closest to assembly?

C is not closer to the hardware compared to assembler, on the contrary it is far from the hardware. So your question doesn't make any sense. Even if it did, it is far too broad for Stack Overflow. One could easily write a book about all the things in C that makes it closer to the hardware than for example Java.

Is assembly more efficient than C++?

Is Assembly faster than C++? The unsatisfying answer: Nearly every C++ compiler can output assembly language,* so assembly language can be exactly the same speed as C++ if you use C++ to develop the assembly code.

What is the difference between assembly and C?

Programs written in assembly can execute faster, while programs written in C are easier to develop and maintain. In traditional applications, such as programs run on personal computers and mainframes, C is almost always the first choice.


Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).


Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

  • Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
  • _umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.


Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)


Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.


Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.


Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.

I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.

I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.


Pretty much anytime the compiler sees floating point code, a hand written version will be quicker if you're using an old bad compiler. (2019 update: This is not true in general for modern compilers. Especially when compiling for anything other than x87; compilers have an easier time with SSE2 or AVX for scalar math, or any non-x86 with a flat FP register set, unlike x87's register stack.)

The primary reason is that the compiler can't perform any robust optimisations. See this article from MSDN for a discussion on the subject. Here's an example where the assembly version is twice the speed as the C version (compiled with VS2K5):

#include "stdafx.h"
#include <windows.h>

float KahanSum(const float *data, int n)
{
   float sum = 0.0f, C = 0.0f, Y, T;

   for (int i = 0 ; i < n ; ++i) {
      Y = *data++ - C;
      T = sum + Y;
      C = T - sum - Y;
      sum = T;
   }

   return sum;
}

float AsmSum(const float *data, int n)
{
  float result = 0.0f;

  _asm
  {
    mov esi,data
    mov ecx,n
    fldz
    fldz
l1:
    fsubr [esi]
    add esi,4
    fld st(0)
    fadd st(0),st(2)
    fld st(0)
    fsub st(0),st(3)
    fsub st(0),st(2)
    fstp st(2)
    fstp st(2)
    loop l1
    fstp result
    fstp result
  }

  return result;
}

int main (int, char **)
{
  int count = 1000000;

  float *source = new float [count];

  for (int i = 0 ; i < count ; ++i) {
    source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
  }

  LARGE_INTEGER start, mid, end;

  float sum1 = 0.0f, sum2 = 0.0f;

  QueryPerformanceCounter (&start);

  sum1 = KahanSum (source, count);

  QueryPerformanceCounter (&mid);

  sum2 = AsmSum (source, count);

  QueryPerformanceCounter (&end);

  cout << "  C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
  cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;

  return 0;
}

And some numbers from my PC running a default release build*:

  C code: 500137 in 103884668
asm code: 500137 in 52129147

Out of interest, I swapped the loop with a dec/jnz and it made no difference to the timings - sometimes quicker, sometimes slower. I guess the memory limited aspect dwarfs other optimisations. (Editor's note: more likely the FP latency bottleneck is enough to hide the extra cost of loop. Doing two Kahan summations in parallel for the odd/even elements, and adding those at the end, could maybe speed this up by a factor of 2.)

Whoops, I was running a slightly different version of the code and it outputted the numbers the wrong way round (i.e. C was faster!). Fixed and updated the results.


Without giving any specific example or profiler evidence, you can write better assembler than the compiler when you know more than the compiler.

In the general case, a modern C compiler knows much more about how to optimize the code in question: it knows how the processor pipeline works, it can try to reorder instructions quicker than a human can, and so on - it's basically the same as a computer being as good as or better than the best human player for boardgames, etc. simply because it can make searches within the problem space faster than most humans. Although you theoretically can perform as well as the computer in a specific case, you certainly can't do it at the same speed, making it infeasible for more than a few cases (i.e. the compiler will most certainly outperform you if you try to write more than a few routines in assembler).

On the other hand, there are cases where the compiler does not have as much information - I'd say primarily when working with different forms of external hardware, of which the compiler has no knowledge. The primary example probably being device drivers, where assembler combined with a human's intimate knowledge of the hardware in question can yield better results than a C compiler could do.

Others have mentioned special purpose instructions, which is what I'm talking in the paragraph above - instructions of which the compiler might have limited or no knowledge at all, making it possible for a human to write faster code.


In my job, there are three reasons for me to know and use assembly. In order of importance:

  1. Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that.

  2. Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this:

    for (int y=0; y < imageHeight; y++) {
        for (int x=0; x < imageWidth; x++) {
           // do something
        }
    }
    

    the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months.

  3. doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.


Only when using some special purpose instruction sets the compiler doesn't support.

To maximize the computing power of a modern CPU with multiple pipelines and predictive branching you need to structure the assembly program in a way that makes it a) almost impossible for a human to write b) even more impossible to maintain.

Also, better algorithms, data structures and memory management will give you at least an order of magnitude more performance than the micro-optimizations you can do in assembly.


Although C is "close" to the low-level manipulation of 8-bit, 16-bit, 32-bit, 64-bit data, there are a few mathematical operations not supported by C which can often be performed elegantly in certain assembly instruction sets:

  1. Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back:

    int16_t x, y;
    // int16_t is a typedef for "short"
    // set x and y to something
    int16_t prod = (int16_t)(((int32_t)x*y)>>16);`
    

    In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself.

  2. Certain bitshifting operations (rotation/carries):

    // 256-bit array shifted right in its entirety:
    uint8_t x[32];
    for (int i = 32; --i > 0; )
    {
       x[i] = (x[i] >> 1) | (x[i-1] << 7);
    }
    x[0] >>= 1;
    

    This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer.

    For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.

Having said that, I wouldn't resort to these techniques unless I had serious performance constraints. As others have said, assembly is much harder to document/debug/test/maintain than C code: the performance gain comes with some serious costs.

edit: 3. Overflow detection is possible in assembly (can't really do it in C), this makes some algorithms much easier.