Consider the following toy example, where A
is an n x 2
matrix stored in column-major order and I want to compute its column sum. sum_0
only computes sum of the 1st column, while sum_1
does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0
to j
). It is constructed to demonstrate the template problem I have in reality.
/* "test.c" */
#include <stdlib.h>
// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
if (n == 0) return;
size_t i;
double *a = A, *b = A + n;
double c0 = 0.0, c1 = 0.0;
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
for (i = 0; i < n; i++) {
c0 += a[i];
if (j > 0) c1 += b[i];
}
c[0] = c0;
if (j > 0) c[1] = c1;
}
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
sum_template(j, n, A, c); \
}
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
If I compile it with
gcc -O2 -mavx test.c
GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1
for function sum_0
(Check it on Godbolt).
I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.
However, such convenience is lost if I activate OpenMP 4.0+ with
gcc -O2 -mavx -fopenmp test.c
sum_template
is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx
to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).
Remark
Using GCC's auto-vectorization -ftree-vectorize -ffast-math
would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.
Background
I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.
The simplest way to solve this problem is with __attribute__((always_inline))
, or other compiler-specific overrides.
#ifdef __GNUC__
#define ALWAYS_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
#define ALWAYS_INLINE __forceinline inline
#else
#define ALWAYS_INLINE inline // cross your fingers
#endif
ALWAYS_INLINE
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
...
}
Godbolt proof that it works.
Also, don't forget to use -mtune=haswell
, not just -mavx
. It's usually a good idea. (However, promising aligned data will stop gcc's default -mavx256-split-unaligned-load
tuning from splitting 256-bit loads into 128-bit vmovupd
+ vinsertf128
, so code gen for this function is fine with tune=haswell. But normally you want this for gcc to auto-vectorize any other functions.
You don't really need static
along with inline
; if a compiler decides not to inline it, it can at least share the same definition across compilation units.
Normally gcc decides to inline or not according to function-size heuristics. But even setting -finline-limit=90000
doesn't get gcc to inline with your #pragma omp
(How do I force gcc to inline a function?). I had been guessing that gcc didn't realize that constant-propagation after inlining would simplify the conditional, but 90000 "pseudo-instructions" seems plenty big. There could be other heuristics.
Possibly OpenMP sets some per-function stuff differently in ways that could break the optimizer if it let them inline into other functions. Using __attribute__((target("avx")))
stops that function from inlining into functions compiled without AVX (so you can do runtime dispatching safely, without inlining "infecting" other functions with AVX instructions across if(avx)
conditions.)
One thing OpenMP does that you don't get with regular auto-vectorization is that reductions can be vectorized without enabling -ffast-math
.
Unfortunately OpenMP still doesn't bother to unroll with multiple accumulators or anything to hide FP latency. #pragma omp
is a pretty good hint that a loop is actually hot and worth spending code-size on, so gcc should really do that, even without -fprofile-use
.
So especially if this ever runs on data that's hot in L2 or L1 cache (or maybe L3), you should do something to get better throughput.
And BTW, alignment isn't usually a huge deal for AVX on Haswell. But 64-byte alignment does matter a lot more in practice for AVX512 on SKX. Like maybe 20% slowdown for misaligned data, instead of a couple %.
(But promising alignment at compile time is a separate issue from actually having your data aligned at runtime. Both are helpful, but promising alignment at compile time makes tighter code with gcc7 and earlier, or on any compiler without AVX.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With