Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does compiler use SSE instructions for a regular C code?

I see people using -msse -msse2 -mfpmath=sse flags by default hoping that this will improve performance. I know that SSE gets engaged when special vector types are used in the C code. But do these flags make any difference for regular C code? Does compiler use SSE to optimize regular C code?

like image 543
Jennifer M. Avatar asked Jun 10 '18 17:06

Jennifer M.


People also ask

What compiler does C use?

There are many compilers for C, but we will focus on a free open source version called the Gnu C compiler. (Actually we will use the Gnu C++ compiler, but all C programs compile using this compiler).

How do compilers know which instructions to use?

Most compilers only know how emit code for a specific CPU (or a small number of them). Each target CPU requires that someone write a compiler back-end for it, and that task is non-trivial. GCC supports a large variety of targets, but even GCC is built to emit code for only a few targets.

How is C compiler compiled?

Usually, a first compiler is written in another language (directly in PDP11 assembler in this case, or in C for most of the "modern" languages). Then, this first compiler is used to program a compiler written in the language itself. You can read this page about the history of the C language.

Do compilers use SIMD?

One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”.

Why does the compiler use dedicated instructions for SSE loops?

The compiler detected a loop pattern suitable for SSE and therefore decided to use dedicated instructions, however using directly the SSE instructions the generated code was even faster.

Can SSE2 be used on x86-64?

SSE2 is baseline / non-optional for x86-64, so compilers can always use SSE1/SSE2 instructions when targeting x86-64. Later instruction sets (SSE4, AVX, AVX2, AVX512, and non-SIMD extensions like BMI2, popcnt, etc.) have to be enabled manually to tell the compiler it's ok to make code that won't run on older CPUs.

Do modern compilers auto-vectorize with SSE2?

Yes, modern compilers auto-vectorize with SSE2 if you compile with full optimization. clang enables it even at -O2, gcc at -O3. Even at -O1 or -Os, compilers will use SIMD load/store instructions to copy or initialize structs or other objects wider than an integer register.

What is SSE in C++?

This attribute specifies a minimum alignment for the variable or structure field, measured in bytes. Here is a simple code on how to use SSE in order to compute the square root of 4 float in a single operation using the _mm_sqrt_ps function.


Video Answer


1 Answers

Yes, modern compilers auto-vectorize with SSE2 if you compile with full optimization. clang enables it even at -O2, gcc at -O3.

Even at -O1 or -Os, compilers will use SIMD load/store instructions to copy or initialize structs or other objects wider than an integer register. That doesn't really count as auto-vectorization; it's more like part of their default builtin memset / memcpy strategy for small fixed-size blocks. But it does take advantage of and require SIMD instructions to be supported.


SSE2 is baseline / non-optional for x86-64, so compilers can always use SSE1/SSE2 instructions when targeting x86-64. Later instruction sets (SSE4, AVX, AVX2, AVX512, and non-SIMD extensions like BMI2, popcnt, etc.) have to be enabled manually to tell the compiler it's ok to make code that won't run on older CPUs. Or to get it to generate multiple versions of code and choose at runtime, but that has extra overhead and is only worth it for larger functions.

-msse -msse2 -mfpmath=sse is already the default for x86-64, but not for 32-bit i386. Some 32-bit calling conventions return FP values in x87 registers, so it can be inconvenient to use SSE/SSE2 for computation and then have to store/reload the result to get it in x87 st(0). With -mfpmath=sse, smarter compilers might still use x87 for a calculation that produces an FP return value.

On 32-bit x86, -msse2 might not be on by default, it depends on how your compiler was configured. If you're using 32-bit because you're targeting CPUs that are so old they can't run 64-bit code, you might want to make sure it's disabled, or only -msse.

The best way to make a binary tuned for the CPU you're compiling on is -O3 -march=native -mfpmath=sse, and use link-time optimization + profile-guided optimization. (gcc -fprofile-generate / run on some test data / gcc -fprofile-use).

Using -march=native makes binaries that might not run on earlier CPUs, if the compiler does choose to use new instructions. Profile-guided optimization is very helpful for gcc: it never unrolls loops without it. But with PGO, it knows which loops run often / for a lot of iterations, i.e. which loops are "hot" and worth spending more code-size on. Link-time optimization allows inlining / constant-propagation across files. It's very helpful if you have C++ with a lot of small functions that you don't actually define in header files.


See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output and making sense of it.

Here are some specific examples on the Godbolt compiler explorer for x86-64. Godbolt also has gcc for several other architectures, and with clang you can add -target mips or whatever, so you can also see auto-vectorization for ARM NEON with the right compiler options to enable it. You can use -m32 with the x86-64 compilers to get 32-bit code-gen.

int sumint(int *arr) {
    int sum = 0;
    for (int i=0 ; i<2048 ; i++){
        sum += arr[i];
    }
    return sum;
}

inner loop with gcc8.1 -O3 (without -march=haswell or anything to enable AVX/AVX2):

.L2:                                 # do {
    movdqu  xmm2, XMMWORD PTR [rdi]    # load 16 bytes
    add     rdi, 16
    paddd   xmm0, xmm2                 # packed add of 4 x 32-bit integers
    cmp     rax, rdi
    jne     .L2                      # } while(p != endp)

    # then horizontal add and extract a single 32-bit sum

Without -ffast-math, compilers can't reorder FP operations, so the float equivalent don't auto-vectorize (see the Godbolt link: you get scalar addss). (OpenMP can enable it on a per-loop basis, or use -ffast-math).

But some FP stuff can safely auto-vectorize without changing order of operations.

// clang won't contract this into an FMA without -ffast-math :/
// but gcc will (if you compile with -march=haswell)
void scale_array(float *arr) {
    for (int i=0 ; i<2048 ; i++){
        arr[i] = arr[i] * 2.1f + 1.234f;
    }
}

  # load constants: xmm2 = {2.1,  2.1,  2.1,  2.1}
  #                 xmm1 = (1.23, 1.23, 1.23, 1.23}
.L9:   # gcc8.1 -O3                       # do {
    movups  xmm0, XMMWORD PTR [rdi]         # load unaligned packed floats
    add     rdi, 16
    mulps   xmm0, xmm2                      # multiply Packed Single-precision
    addps   xmm0, xmm1                      # add Packed Single-precision
    movups  XMMWORD PTR [rdi-16], xmm0      # store back to the array
    cmp     rax, rdi
    jne     .L9                           # }while(p != endp)

multiplier = 2.0f results in using addps to double, cutting throughput by a factor of 2 on Haswell / Broadwell! Because before SKL, FP add only runs on one execution port, but there are two FMA units that can run multiplies. SKL dropped the adder and runs add with the same 2 per clock throughput and latency as mul and FMA. (http://agner.org/optimize/, and see other performance links in the x86 tag wiki.)

Compiling with -march=haswell lets the compiler use a single FMA for the scale + add. (But clang won't contract the expression into an FMA unless you use -ffast-math. IIRC there's an option to enable FP contraction without other aggressive operations.)

like image 57
Peter Cordes Avatar answered Oct 06 '22 08:10

Peter Cordes