Does compiler use SSE instructions for a regular C code?

Tags:

I see people using -msse -msse2 -mfpmath=sse flags by default hoping that this will improve performance. I know that SSE gets engaged when special vector types are used in the C code. But do these flags make any difference for regular C code? Does compiler use SSE to optimize regular C code?

543

asked Jun 10 '18 17:06

Jennifer M.

Video Answer

1 Answers

Yes, modern compilers auto-vectorize with SSE2 if you compile with full optimization. clang enables it even at -O2, gcc at -O3.

Even at -O1 or -Os, compilers will use SIMD load/store instructions to copy or initialize structs or other objects wider than an integer register. That doesn't really count as auto-vectorization; it's more like part of their default builtin memset / memcpy strategy for small fixed-size blocks. But it does take advantage of and require SIMD instructions to be supported.

SSE2 is baseline / non-optional for x86-64, so compilers can always use SSE1/SSE2 instructions when targeting x86-64. Later instruction sets (SSE4, AVX, AVX2, AVX512, and non-SIMD extensions like BMI2, popcnt, etc.) have to be enabled manually to tell the compiler it's ok to make code that won't run on older CPUs. Or to get it to generate multiple versions of code and choose at runtime, but that has extra overhead and is only worth it for larger functions.

-msse -msse2 -mfpmath=sse is already the default for x86-64, but not for 32-bit i386. Some 32-bit calling conventions return FP values in x87 registers, so it can be inconvenient to use SSE/SSE2 for computation and then have to store/reload the result to get it in x87 st(0). With -mfpmath=sse, smarter compilers might still use x87 for a calculation that produces an FP return value.

On 32-bit x86, -msse2 might not be on by default, it depends on how your compiler was configured. If you're using 32-bit because you're targeting CPUs that are so old they can't run 64-bit code, you might want to make sure it's disabled, or only -msse.

The best way to make a binary tuned for the CPU you're compiling on is -O3 -march=native -mfpmath=sse, and use link-time optimization + profile-guided optimization. (gcc -fprofile-generate / run on some test data / gcc -fprofile-use).

Using -march=native makes binaries that might not run on earlier CPUs, if the compiler does choose to use new instructions. Profile-guided optimization is very helpful for gcc: it never unrolls loops without it. But with PGO, it knows which loops run often / for a lot of iterations, i.e. which loops are "hot" and worth spending more code-size on. Link-time optimization allows inlining / constant-propagation across files. It's very helpful if you have C++ with a lot of small functions that you don't actually define in header files.

See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output and making sense of it.

Here are some specific examples on the Godbolt compiler explorer for x86-64. Godbolt also has gcc for several other architectures, and with clang you can add -target mips or whatever, so you can also see auto-vectorization for ARM NEON with the right compiler options to enable it. You can use -m32 with the x86-64 compilers to get 32-bit code-gen.

int sumint(int *arr) {
    int sum = 0;
    for (int i=0 ; i<2048 ; i++){
        sum += arr[i];
    }
    return sum;
}

inner loop with gcc8.1 -O3 (without -march=haswell or anything to enable AVX/AVX2):

.L2:                                 # do {
    movdqu  xmm2, XMMWORD PTR [rdi]    # load 16 bytes
    add     rdi, 16
    paddd   xmm0, xmm2                 # packed add of 4 x 32-bit integers
    cmp     rax, rdi
    jne     .L2                      # } while(p != endp)

    # then horizontal add and extract a single 32-bit sum

Without -ffast-math, compilers can't reorder FP operations, so the float equivalent don't auto-vectorize (see the Godbolt link: you get scalar addss). (OpenMP can enable it on a per-loop basis, or use -ffast-math).

But some FP stuff can safely auto-vectorize without changing order of operations.

// clang won't contract this into an FMA without -ffast-math :/
// but gcc will (if you compile with -march=haswell)
void scale_array(float *arr) {
    for (int i=0 ; i<2048 ; i++){
        arr[i] = arr[i] * 2.1f + 1.234f;
    }
}

  # load constants: xmm2 = {2.1,  2.1,  2.1,  2.1}
  #                 xmm1 = (1.23, 1.23, 1.23, 1.23}
.L9:   # gcc8.1 -O3                       # do {
    movups  xmm0, XMMWORD PTR [rdi]         # load unaligned packed floats
    add     rdi, 16
    mulps   xmm0, xmm2                      # multiply Packed Single-precision
    addps   xmm0, xmm1                      # add Packed Single-precision
    movups  XMMWORD PTR [rdi-16], xmm0      # store back to the array
    cmp     rax, rdi
    jne     .L9                           # }while(p != endp)

multiplier = 2.0f results in using addps to double, cutting throughput by a factor of 2 on Haswell / Broadwell! Because before SKL, FP add only runs on one execution port, but there are two FMA units that can run multiplies. SKL dropped the adder and runs add with the same 2 per clock throughput and latency as mul and FMA. (http://agner.org/optimize/, and see other performance links in the x86 tag wiki.)

Compiling with -march=haswell lets the compiler use a single FMA for the scale + add. (But clang won't contract the expression into an FMA unless you use -ffast-math. IIRC there's an option to enable FP contraction without other aggressive operations.)

answered Oct 06 '22 08:10

Peter Cordes

Related questions
                            
                                C String Literal "too big for character"
                            
                                Why do we need to add a '\0' (null) at the end of a character array in C?
                            
                                Maximum size of a bit field in C or C++? [duplicate]
                            
                                Debug fork() in eclipse cdt
                            
                                Runtime Memory allocation on stack
                            
                                note: previous implicit declaration of ‘point_forward’ was here
                            
                                __isoc99_scanf and scanf
                            
                                In what languages besides C can I write a C library?
                            
                                open() doesn't set O_CLOEXEC flag
                            
                                Why is h_addr_list in struct hostent a char ** instead of struct in_addr **?
                            
                                Nameless union inside a union
                            
                                Difference in floating point arithmetics between x86 and x64
                            
                                What is MODULE_ALIAS in Linux device driver code?
                            
                                How to struct unpack c null terminated string?
                            
                                Valgrind won't detect buffer overflow
                            
                                How to chain BCryptEncrypt and BCryptDecrypt calls using AES in GCM mode?
                            
                                Does comma separators in type definition in C guarantee the order?
                            
                                Is it possible to check if any of 2 sets of 3 ints is equal with less than 9 comparisons?
                            
                                Use of # in a macro [duplicate]
                            
                                Passing arrays in C: square brackets vs. pointer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does compiler use SSE instructions for a regular C code?

Tags:

c

compiler-optimization

compilation

simd

sse

Jennifer M.

People also ask

Video Answer

1 Answers

Peter Cordes

Recent Activity

Donate For Us