Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance?

If alignment is important for the generated code, how can I let the (visual c++) compiler know that I intend to only pass values with a certain alignment to the function?

like image 348
matthias_buehlmann Avatar asked Nov 03 '15 16:11

matthias_buehlmann


1 Answers

In theory alignment should not matter on Intel processors since Nehalem. Therefore, your compiler should be able to produce code in which a pointer being aligned or not is not an issue.

Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.

Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.

Since AVX there is no advantage to using aligned load/store instructions anymore and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned..

However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.

So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.

I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned. With ICC and also GCC you can use #pragma omp simd aligned. With ICC you can also use __assume_aligned.

For example with GCC compiling this simple loop

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 16);
    //b = (float*)__builtin_assume_aligned (b, 16);
    for(int i=0; i<(n & (-4)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

with gcc -O3 -march=nehalem -S test.c and then wc test.s gives 160 lines. Whereas if use __builtin_assume_aligned then wc test.s gives only 45 lines. When I did this with in both cases clang return 110 lines.

So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.

Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.


In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.


Edit:

The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c and then using objdump -d and counting the lines in the text (code) segment gives 108 lines without using __builtin_assume_aligned and only 16 lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.


Edit:

I went ahead and tested the foo function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):

$LL3@foo:
    movsxd  rax, r9d
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rax*4]
    vmovups XMMWORD PTR [r11+rax*4], xmm1
    lea eax, DWORD PTR [r9+4]
    add r9d, 8
    movsxd  rcx, eax
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
    vmovups XMMWORD PTR [r11+rcx*4], xmm1
    cmp r9d, edx
    jl  SHORT $LL3@foo

This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)). At least GCC gets that right.


Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 32);
    //b = (float*)__builtin_assume_aligned (b, 32);
    for(int i=0; i<(n & (-8)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

without __builtin_assume_aligned GCC produces 168 lines of assembly and with it it only produces 17 lines.

like image 175
6 revs Avatar answered Oct 12 '22 19:10

6 revs