I wrote some code with static arrays and it vectorizes just fine.
float data[1024] __attribute__((aligned(16)));
I would like to make the arrays dynamically allocated. I tried doing something like this:
float *data = (float*) aligned_alloc(16, size*sizeof(float));
But the compiler (GCC 4.9.2), no longer can vectorize the code. I assume this is because it doesn't know the pointer data is 16 byte aligned. I am getting messages like:
note: Unknown alignment for access: *_43
I have tried adding this line before the data is used, but it doesn't seem to do anything:
data = (float*) __builtin_assume_aligned(data, 16);
Using a different variable and restrict
did not help:
float* __restrict__ align_data = (float*) __builtin_assume_aligned(data,16);
Example:
#include <iostream>
#include <stdlib.h>
#include <math.h>
#define SIZE 1024
#define DYNAMIC 0
#define A16 __attribute__((aligned(16)))
#define DA16 (float*) aligned_alloc(16, size*sizeof(float))
class Test{
public:
int size;
#if DYNAMIC
float *pos;
float *vel;
float *alpha;
float *k_inv;
float *osc_sin;
float *osc_cos;
float *dosc1;
float *dosc2;
#else
float pos[SIZE] A16;
float vel[SIZE] A16;
float alpha[SIZE] A16;
float k_inv[SIZE] A16;
float osc_sin[SIZE] A16;
float osc_cos[SIZE] A16;
float dosc1[SIZE] A16;
float dosc2[SIZE] A16;
#endif
Test(int arr_size){
size = arr_size;
#if DYNAMIC
pos = DA16;
vel = DA16;
alpha = DA16;
k_inv = DA16;
osc_sin = DA16;
osc_cos = DA16;
dosc1 = DA16;
dosc2 = DA16;
#endif
}
void compute(){
for (int i=0; i<size; i++){
float lambda = .67891*k_inv[i],
omega = (.89 - 2*alpha[i]*lambda)*k_inv[i],
diff2 = pos[i] - omega,
diff1 = vel[i] - lambda + alpha[i]*diff2;
pos[i] = osc_sin[i]*diff1 + osc_cos[i]*diff2 + lambda*.008 + omega;
vel[i] = dosc1[i]*diff1 - dosc2[i]*diff2 + lambda;
}
}
};
int main(int argc, char** argv){
Test t(SIZE);
t.compute();
std::cout << t.pos[10] << std::endl;
std::cout << t.vel[10] << std::endl;
}
Here is how I am compiling:
g++ -o test test.cpp -O3 -march=native -ffast-math -fopt-info-optimized
When DYNAMIC
is set to 0
, it outputs:
test.cpp:46:4: note: loop vectorized
but when it is set to 1
it outputs nothing.
Each byte is 8 bits, so to align on a 16 byte boundary, you need to align to each set of two bytes. Similarly, memory aligned on a 32 bit (4 byte) boundary would have a memory address that's a multiple of four, because you group four bytes together to form a 32 bit word.
The GNU documentation states that malloc is aligned to 16 byte multiples on 64 bit systems.
General Byte Alignment RulesStructures between 5 and 8 bytes of data should be padded so that the total structure is 8 bytes. Structures between 9 and 16 bytes of data should be padded so that the total structure is 16 bytes. Structures greater than 16 bytes should be padded to 16 byte boundary.
An object that is "8 bytes aligned" is stored at a memory address that is a multiple of 8. Many CPUs will only load some data types from aligned locations; on other CPUs such access is just faster. There's also several other possible reasons for using memory alignment - without seeing the code it's hard to say why.
The compiler isn't vectorizing the loop because it can't determine that the dynamically allocated pointers don't alias each other. A simple way to allow your sample code to be vectorized is to pass the --param vect-max-version-for-alias-checks=1000
option. This will allow the compiler to emit all the checks necessary to see if the pointers are actually aliased.
Another simple solution to allow your you example code to be vectorized is to rename main
, as suggested by Marc Glisse in his comment. Functions named main
apparently have certain optimizations disabled. Named something else, GCC 4.9.2 can track the use of this->foo
(and the other pointer members) in compute
back to their allocations in Test()
.
However, I assume something other than your class being used in a function named main
prevented your code being vectorized in your real code. A more general solution that allows your code to vectorized without aliasing or alignment checks is to use the restrict
keyword and the aligned
attribute. Something like this:
typedef float __attribute__((aligned(16))) float_a16;
__attribute__((noinline))
static void _compute(float_a16 * __restrict__ pos,
float_a16 * __restrict__ vel,
float_a16 * __restrict__ alpha,
float_a16 * __restrict__ k_inv,
float_a16 * __restrict__ osc_sin,
float_a16 * __restrict__ osc_cos,
float_a16 * __restrict__ dosc1,
float_a16 * __restrict__ dosc2,
int size) {
for (int i=0; i<size; i++){
float lambda = .67891*k_inv[i],
omega = (.89 - 2*alpha[i]*lambda)*k_inv[i],
diff2 = pos[i] - omega,
diff1 = vel[i] - lambda + alpha[i]*diff2;
pos[i] = osc_sin[i]*diff1 + osc_cos[i]*diff2 + lambda*.008 + omega;
vel[i] = dosc1[i]*diff1 - dosc2[i]*diff2 + lambda;
}
}
void compute() {
_compute(pos, vel, alpha, k_inv, osc_sin, osc_cos, dosc1, dosc2,
size);
}
The noinline
attribute is critical, otherwise inlining can cause the pointers to lose their restrictedness and alignedness. The compiler seems to ignore the restrict
keyword in contexts other than function parameters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With