Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is inline assembly language slower than native C++ code?

People also ask

Is assembly code faster than C?

Actually, the short answer is: Assembler is always faster or equal to the speed of C. The reason is that you can have assembly without C, but you can't have C without assembly (in the binary form, which we in the old days called "machine code").

Is C or assembly more efficient?

The reason C is faster than assembly is because the only way to write optimal code is to measure it on a real machine, and with C you can run many more experiments, much faster.

Is assembly language faster than machine language?

Machine language has the fastest execution speed in comparison to any other programming language. All the data in machine language is already present in computer understandable form. On the other hand, Assembly language is slower in execution as the code must be converted to machine language before execution.

Is machine language slower than assembly language?

Machine language can be extremely time-consuming, tedious, and error-prone. However, that's not the case with Assembly language as mnemonic names and symbols are available here. It is much less tedious and error-prone than the binary machine code.


Yes, most times.

First of all you start from wrong assumption that a low-level language (assembly in this case) will always produce faster code than high-level language (C++ and C in this case). It's not true. Is C code always faster than Java code? No because there is another variable: programmer. The way you write code and knowledge of architecture details greatly influence performance (as you saw in this case).

You can always produce an example where handmade assembly code is better than compiled code but usually it's a fictional example or a single routine not a true program of 500.000+ lines of C++ code). I think compilers will produce better assembly code 95% times and sometimes, only some rare times, you may need to write assembly code for few, short, highly used, performance critical routines or when you have to access features your favorite high-level language does not expose. Do you want a touch of this complexity? Read this awesome answer here on SO.

Why this?

First of all because compilers can do optimizations that we can't even imagine (see this short list) and they will do them in seconds (when we may need days).

When you code in assembly you have to make well-defined functions with a well-defined call interface. However they can take in account whole-program optimization and inter-procedural optimization such as register allocation, constant propagation, common subexpression elimination, instruction scheduling and other complex, not obvious optimizations (Polytope model, for example). On RISC architecture guys stopped worrying about this many years ago (instruction scheduling, for example, is very hard to tune by hand) and modern CISC CPUs have very long pipelines too.

For some complex microcontrollers even system libraries are written in C instead of assembly because their compilers produce a better (and easy to maintain) final code.

Compilers sometimes can automatically use some MMX/SIMDx instructions by themselves, and if you don't use them you simply can't compare (other answers already reviewed your assembly code very well). Just for loops this is a short list of loop optimizations of what is commonly checked for by a compiler (do you think you could do it by yourself when your schedule has been decided for a C# program?) If you write something in assembly, I think you have to consider at least some simple optimizations. The school-book example for arrays is to unroll the cycle (its size is known at compile time). Do it and run your test again.

These days it's also really uncommon to need to use assembly language for another reason: the plethora of different CPUs. Do you want to support them all? Each has a specific microarchitecture and some specific instruction sets. They have different number of functional units and assembly instructions should be arranged to keep them all busy. If you write in C you may use PGO but in assembly you will then need a great knowledge of that specific architecture (and rethink and redo everything for another architecture). For small tasks the compiler usually does it better, and for complex tasks usually the work isn't repaid (and compiler may do better anyway).

If you sit down and you take a look at your code probably you'll see that you'll gain more to redesign your algorithm than to translate to assembly (read this great post here on SO), there are high-level optimizations (and hints to compiler) you can effectively apply before you need to resort to assembly language. It's probably worth to mention that often using intrinsics you will have performance gain your're looking for and compiler will still be able to perform most of its optimizations.

All this said, even when you can produce a 5~10 times faster assembly code, you should ask your customers if they prefer to pay one week of your time or to buy a 50$ faster CPU. Extreme optimization more often than not (and especially in LOB applications) is simply not required from most of us.


Your assembly code is suboptimal and may be improved:

  • You are pushing and popping a register (EDX) in your inner loop. This should be moved out of the loop.
  • You reload the array pointers in every iteration of the loop. This should moved out of the loop.
  • You use the loop instruction, which is known to be dead slow on most modern CPUs (possibly a result of using an ancient assembly book*)
  • You take no advantage of manual loop unrolling.
  • You don't use available SIMD instructions.

So unless you vastly improve your skill-set regarding assembler, it doesn't make sense for you to write assembler code for performance.

*Of course I don't know if you really got the loop instruction from an ancient assembly book. But you almost never see it in real world code, as every compiler out there is smart enough to not emit loop, you only see it in IMHO bad and outdated books.


Even before delving into assembly, there are code transformations that exist at a higher level.

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
  for (int i = 0; i < TIMES; i++) {
    for (int j = 0; j < length; j++) {
      x[j] += y[j];
    }
  }
}

can be transformed into via Loop Rotation:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      for (int i = 0; i < TIMES; ++i) {
        x[j] += y[j];
      }
    }
}

which is much better as far as memory locality goes.

This could be optimizes further, doing a += b X times is equivalent to doing a += X * b so we get:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      x[j] += TIMES * y[j];
    }
}

however it seems my favorite optimizer (LLVM) does not perform this transformation.

[edit] I found that the transformation is performed if we had the restrict qualifier to x and y. Indeed without this restriction, x[j] and y[j] could alias to the same location which makes this transformation erroneous. [end edit]

Anyway, this is, I think, the optimized C version. Already it is much simpler. Based on this, here is my crack at ASM (I let Clang generate it, I am useless at it):

calcuAsm:                               # @calcuAsm
.Ltmp0:
    .cfi_startproc
# BB#0:
    testl   %edx, %edx
    jle .LBB0_2
    .align  16, 0x90
.LBB0_1:                                # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    imull   $100000, (%rsi), %eax   # imm = 0x186A0
    addl    %eax, (%rdi)
    addq    $4, %rsi
    addq    $4, %rdi
    decl    %edx
    jne .LBB0_1
.LBB0_2:                                # %._crit_edge
    ret
.Ltmp1:
    .size   calcuAsm, .Ltmp1-calcuAsm
.Ltmp2:
    .cfi_endproc

I am afraid I don't understand where all those instructions come from, however you can always have fun and try and see how it compares... but I'd still use the optimized C version rather than the assembly one, in code, much more portable.


Short answer: yes.

Long answer: yes, unless you really know what you're doing, and have a reason to do so.


I have fixed my asm code:

  __asm
{   
    mov ebx,TIMES
 start:
    mov ecx,lengthOfArray
    mov esi,x
    shr ecx,1
    mov edi,y
label:
    movq mm0,QWORD PTR[esi]
    paddd mm0,QWORD PTR[edi]
    add edi,8
    movq QWORD PTR[esi],mm0
    add esi,8
    dec ecx 
    jnz label
    dec ebx
    jnz start
};

Results for Release version:

 Function of assembly version: 81
 Function of C++ version: 161

The assembly code in release mode is almost 2 times faster than the C++.


Does that mean I should not trust the performance of assembly language written by my hands

Yes, that is exactly what it means, and it is true for every language. If you don't know how to write efficient code in language X, then you should not trust your ability to write efficient code in X. And so, if you want efficient code, you should use another language.

Assembly is particularly sensitive to this, because, well, what you see is what you get. You write the specific instructions that you want the CPU to execute. With high level languages, there is a compiler in betweeen, which can transform your code and remove many inefficiencies. With assembly, you're on your own.


The only reason to use assembly language nowadays is to use some features not accessible by the language.

This applies to:

  • Kernel programming that needs to access to certain hardware features such as the MMU
  • High performance programming that uses very specific vector or multimedia instructions not supported by your compiler.

But current compilers are quite smart, they can even replace two separate statements like d = a / b; r = a % b; with a single instruction that calculates the division and remainder in one go if it is available, even if C does not have such operator.