I have a function defined as
inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
v3 = _mm512_add_pd(v1, v2);
}
(the __m512d
is a native data type mapping to SIMD registers on Intel MIC architecture)
As this function is rather short and gets invoked frequently, I'd like it to be inlined at every invocation. But Intel's compiler seems reluctant to inline this function, even after I use the -inline-forceinline
and -O3
options. It reports that 'Forceinline not honored for call ...' while compiling. As I have to use some compiler specific features, e.g. the __m512d
type, Intel compiler is my only option.
More Info:
The file structure is quite simple. The function vec_add
is defined in a header file mic.h
, which is included in another file test.cc
. Function vec_add
is just invoked repeatedly in a loop, and there're no function pointers involved. A simplified version of the code in test.cc
looks like this
for (int i = 0; i < LENGTH; i += 8) {
// a, b, c are arrays of doubles, and each SIMD register can hold 8 doubles
__mm512d va = _mm512_load_pd(a + i); // load SIMD register from memory
__mm512d vb = _mm512_load_pd(b + i); // ditto
__mm512d vc;
vec_add(vc, va, vb); // store SIMD register to memory
_mm512_store_pd(c + i, vc);
}
I've tried all kinds of hints, like __attribute__((always_inline))
,__forceinline
, and compiler option -inline-forceinline
, none of which worked yet.
Complete code
I've put all the relevant code together in a simplified form. You can try it out if you have a Intel compiler. Use option -Winline
to view inline reports and -inline-forceinline
to force inlining.
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#define LEN (1<<20)
__attribute((target(mic)))
inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
v3 = _mm512_add_pd(v1, v2);
}
int main() {
#pragma offload target(mic)
{
double *a = (double*)_mm_malloc(LEN*sizeof(double), 64);
double *b = (double*)_mm_malloc(LEN*sizeof(double), 64);
double *c = (double*)_mm_malloc(LEN*sizeof(double), 64);
for (int i = 0; i < LEN; i++) {
a[i] = (double)rand()/RAND_MAX;
b[i] = (double)rand()/RAND_MAX;
}
for (int i = 0; i < LEN; i += 8) {
__m512d va = _mm512_load_pd(a + i);
__m512d vb = _mm512_load_pd(b + i);
__m512d vc;
vec_add(vc, va, vb);
_mm512_store_pd(c + i, vc);
}
_mm_free(a);
_mm_free(b);
_mm_free(c);
}
}
Configurations
-O3 -inline-forceinline -Winline
Do you have any idea why this function can't be inlined? And how can I get it inlined after all(I don't want to turn to macros)?
For some reason the Intel Compiler doesn't do inlining of functions in offloaded code (I'm not all that familiar with the concept, so I don't know what the technical reason for this is). See effective-use-of-the-intel-compilers-offload-features for more information (just search for "inline").
Quoting from the linked article:
Function Inlining into Offload Constructs
Sometimes inlining a function is necessary for optimum performance of the generated code. Functions called directly within a #pragma offload are not inlined by the compiler even if they are marked as inline. To enable optimum performance of code in offload regions, either manually inline functions, or place the entire offload construct into its own function.
...
One solution is to manually inline function f, as shown in function v2.
Another solution is to move the offload construct into its own function as shown in function v3.
If I understand this correctly, the best thing to do for you would be to place the loops into a separate function which is also marked with __attribute((target(mic))).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With