I'm having a hard time beating my compiler using inline assembly.
What's a good, non-contrived examples of a function which the compiler has a hard time making really, really fast and simple? But that's relatively simple to make with inline assembly.
In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.
It's even possible on most compilers to include a little bit of assembly code right inside your C or C++ file, called "inline assembly" because the assembly is inside the C/C++. This is usually a bit faster (because no function call overhead) and simpler (less hassle at build time) than having a separate ".
The reason C is faster than assembly is because the only way to write optimal code is to measure it on a real machine, and with C you can run many more experiments, much faster.
If you don't consider SIMD operations cheating, you can usually write SIMD assembly that performs much better than your compilers autovectorization abilities (If it even has autovectorization!)
Here's a very basic SSE(One of x86's SIMD instruction sets) tutorial. It's for Visual C++ in-line assembly.
Edit: Here's a small pair of functions if you want to try for yourself. It's the calculation of an n length dot product. One is using SSE 2 instructions in-line (GCC in-line syntax) the other is very basic C.
It's very very simple and I'd be very surprised if a good compiler couldn't vectorize the simple C loop, but if it doesn't you should see a speed up in the SSE2. The SSE 2 version could probably be faster if I used more registers but I don't want to stretch my very weak SSE skills :).
float dot_asm(float *a, float*b, int n)
{
float ans = 0;
int i;
// I'm not doing checking for size % 8 != 0 arrays.
while( n > 0) {
float tmp[4] __attribute__ ((aligned(16)));
__asm__ __volatile__(
"xorps %%xmm0, %%xmm0\n\t"
"movups (%0), %%xmm1\n\t"
"movups 16(%0), %%xmm2\n\t"
"movups (%1), %%xmm3\n\t"
"movups 16(%1), %%xmm4\n\t"
"add $32,%0\n\t"
"add $32,%1\n\t"
"mulps %%xmm3, %%xmm1\n\t"
"mulps %%xmm4, %%xmm2\n\t"
"addps %%xmm2, %%xmm1\n\t"
"addps %%xmm1, %%xmm0"
:"+r" (a), "+r" (b)
:
:"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");
__asm__ __volatile__(
"movaps %%xmm0, %0"
: "=m" (tmp)
:
:"xmm0", "memory" );
for(i = 0; i < 4; i++) {
ans += tmp[i];
}
n -= 8;
}
return ans;
}
float dot_c(float *a, float *b, int n) {
float ans = 0;
int i;
for(i = 0;i < n; i++) {
ans += a[i]*b[i];
}
return ans;
}
Since it's related to the iPhone and assembly code then I'll give an example that would be relevant in iPhone world (and not some sse or x86 asm). If anybody decides to write assembly code for some real world app, then most likely this is going to be some sort of digital signal processing or image manipulation. Examples: converting colorspace of RGB pixels, encoding images to jpeg/png format, or encoding sound to mp3, amr or g729 for voip applications. In case of sound encoding there are many routines that cannot be translated by the compiler to efficient asm code, they simply have no equivalent in C. Examples of the commonly used stuff in sound processing: saturated math, multiply-accumulate routines, matrix multiplication.
Example of saturated add: 32-bit signed int has range: 0x8000 0000 <= int32 <= 0x7fff ffff. If you add two ints result could overflow, but this could be unacceptable in certain cases in digital signal processing. Basically, if result overflows or underflows saturated add should return 0x8000 0000 or 0x7fff ffff. That would be a full c function to check that. an optimized version of saturated add could be:
int saturated_add(int a, int b) { int result = a + b; if (((a ^ b) & 0x80000000) == 0) { if ((result ^ a) & 0x80000000) { result = (a < 0) ? 0x80000000 : 0x7fffffff; } } return result; }
you may also do multiple if/else to check for overflow or on x86 you may check overflow flag (which also requires you to use asm). iPhone uses armv6 or v7 cpu which have dsp asm. So, the saturated_add
function with multiple brunches (if/else statements) and 2 32-bit constants could be one simple asm instruction that uses only one cpu cycle.
So, simply making saturated_add to use asm instruction could make entire algorithm two-three times faster (and smaller in size). Here's the QADD manual:
QADD
other examples of code that often executed in long loops are
res1 = a + b1*c1; res2 = a + b2*c2; res3 = a + b3*c3;
seems like nothing can't be optimized here, but on ARM cpu you can use specific dsp instructions that take less cycles than to do simple multiplication! That's right, a+b * c with specific instructions could execute faster than simple a*b. For this kind of cases compilers simply cannot understand logic of your code and can't use these dsp instructions directly and that's why you need to manually write asm to optimize code, BUT you should only manually write some parts of code that do need to be optimized. If you start writing simple loops manually then almost certainly you won't beat the compiler! There are multiple good papers on the web for inline assembly to code fir filters, amr encoding/decoding etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With