Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

256-bit vectorization via OpenMP SIMD prevents compiler's optimization (say function inlining)?

Consider the following toy example, where A is an n x 2 matrix stored in column-major order and I want to compute its column sum. sum_0 only computes sum of the 1st column, while sum_1 does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0 to j). It is constructed to demonstrate the template problem I have in reality.

/* "test.c" */
#include <stdlib.h>

// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {

  if (n == 0) return;
  size_t i;
  double *a = A, *b = A + n;
  double c0 = 0.0, c1 = 0.0;

  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
  for (i = 0; i < n; i++) {
    c0 += a[i];
    if (j > 0) c1 += b[i];
    }

  c[0] = c0;
  if (j > 0) c[1] = c1;

  }

#define macro_define_sum(FUN, j)            \
void FUN (size_t n, double *A, double *c) { \
  sum_template(j, n, A, c);                 \
  }

macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)

If I compile it with

gcc -O2 -mavx test.c

GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1 for function sum_0 (Check it on Godbolt).

I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.

However, such convenience is lost if I activate OpenMP 4.0+ with

gcc -O2 -mavx -fopenmp test.c

sum_template is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).


Remark

Using GCC's auto-vectorization -ftree-vectorize -ffast-math would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.

Background

I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.

like image 950
Zheyuan Li Avatar asked Jul 03 '18 10:07

Zheyuan Li


1 Answers

The simplest way to solve this problem is with __attribute__((always_inline)), or other compiler-specific overrides.

#ifdef __GNUC__
#define ALWAYS_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
#define ALWAYS_INLINE __forceinline inline
#else
#define ALWAYS_INLINE  inline  // cross your fingers
#endif


ALWAYS_INLINE
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
 ...
}

Godbolt proof that it works.

Also, don't forget to use -mtune=haswell, not just -mavx. It's usually a good idea. (However, promising aligned data will stop gcc's default -mavx256-split-unaligned-load tuning from splitting 256-bit loads into 128-bit vmovupd + vinsertf128, so code gen for this function is fine with tune=haswell. But normally you want this for gcc to auto-vectorize any other functions.

You don't really need static along with inline; if a compiler decides not to inline it, it can at least share the same definition across compilation units.


Normally gcc decides to inline or not according to function-size heuristics. But even setting -finline-limit=90000 doesn't get gcc to inline with your #pragma omp (How do I force gcc to inline a function?). I had been guessing that gcc didn't realize that constant-propagation after inlining would simplify the conditional, but 90000 "pseudo-instructions" seems plenty big. There could be other heuristics.

Possibly OpenMP sets some per-function stuff differently in ways that could break the optimizer if it let them inline into other functions. Using __attribute__((target("avx"))) stops that function from inlining into functions compiled without AVX (so you can do runtime dispatching safely, without inlining "infecting" other functions with AVX instructions across if(avx) conditions.)

One thing OpenMP does that you don't get with regular auto-vectorization is that reductions can be vectorized without enabling -ffast-math.

Unfortunately OpenMP still doesn't bother to unroll with multiple accumulators or anything to hide FP latency. #pragma omp is a pretty good hint that a loop is actually hot and worth spending code-size on, so gcc should really do that, even without -fprofile-use.

So especially if this ever runs on data that's hot in L2 or L1 cache (or maybe L3), you should do something to get better throughput.

And BTW, alignment isn't usually a huge deal for AVX on Haswell. But 64-byte alignment does matter a lot more in practice for AVX512 on SKX. Like maybe 20% slowdown for misaligned data, instead of a couple %.

(But promising alignment at compile time is a separate issue from actually having your data aligned at runtime. Both are helpful, but promising alignment at compile time makes tighter code with gcc7 and earlier, or on any compiler without AVX.)

like image 83
Peter Cordes Avatar answered Oct 20 '22 06:10

Peter Cordes