Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tell gcc that the data pointed to by a pointer will always be aligned?

Tags:

c

memory

gcc

avx

In my program (written in plain C) I have a structure which holds data prepared to be transformed by vectorized (AVX only) radix-2 2D fast fourier transform. The structure looks like this:

struct data {
    double complex *data;
    unsigned int width;
    unsigned int height;
    unsigned int stride;
};

Now I need to load data from memory as fast as possible. As far as I know there exists unaligned and aligned load to ymm registers (vmovapd and vmovupd instructions) and I would like the program to use the aligned version as its faster.

So far I use roughly similar construction for all operations over the array. This example is part of program when data and filter are both already transformed to frequency domain and the filter is applied to data by element by element multiplication.

union m256d {
    __m256d reg;
    double d[4];

};

struct data *data, *filter;
/* Load data and filter here, both have the same width, height and stride. */

unsigned int stride = data->stride;
for(unsigned int i = 0; i<data->height; i++) {
    for(unsigned int j = 0; j<data->width; j+=4) {
        union m256d a[2];
        union m256d b[2];
        union m256d r[2];

        memcpy(a, &(  data->data[i*stride+j]), 2*sizeof(*a));
        memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b));

        r[0].reg = _mm256_mul_pd(a[0].reg, b[0].reg);
        r[1].reg = _mm256_mul_pd(a[1].reg, b[1].reg);

        memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r));
    }
}

As expected memcpy calls are optimized. However after observation gcc translates memcpy either to two vmovupd instructions or to bunch of movq instructions which load data to guaranteedly aligned place on stack and then two vmovapd instructions which load it to ymm registers. This behavior depends whether the memcpy prototype is defined or not (if it is defined then gcc uses movq and vmovapd).

I am able to ensure that the data in memory are aligned but I am not sure how to tell gcc that it can just use movapd instructions to load data from memory straight to ymm registers. I strongly suspect that gcc does not know the fact that data pointed by &(data->data[i*stride+j]) are always aligned.

Is there any option how to tell gcc that the data pointed to by a pointer will always be aligned?

like image 209
Kostrahb Avatar asked Sep 14 '17 21:09

Kostrahb


People also ask

How do I know if my address is 16 byte aligned?

If the address is 16 byte aligned, these must be zero. Notice the lower 4 bits are always 0. The cryptic if statement now becomes very clear and intuitive. We simply mask the upper portion of the address, and check if the lower 4 bits are zero.

What is pointer alignment?

Aligned Pointer means that pointer with adjacent memory location that can be accessed by a adding a constant and its multiples. for char a[5] = "12345"; here a is constant pointer if you and the size of char to it every time you can access the next chracter that is, a +sizeofchar will access 2.

What is a 4 byte aligned address?

For instance, if the address of a data is 12FEECh (1244908 in decimal), then it is 4-byte alignment because the address can be evenly divisible by 4. (You can divide it by 2 or 1, but 4 is the highest number that is divisible evenly.)


1 Answers

vmovupd is exactly as fast as vmovapd when the data is in fact aligned at runtime. The only difference is that vmovapd faults when the data isn't aligned. (See optimization links in the x86 tag wiki, especially Agner Fog's optimization and microarch pdfs, and Intel's optimization manual.

You only have a problem if it ever uses multiple instructions instead of one.


Since you're using Intel intrinsics for _mm256_mul_pd, use load/store intrinsics instead of memcpy! See the sse tag wiki for intrinsics guides and more.

// Hoist this outside the loop,
// mostly for readability; should optimize fine either way.
// Probably only aliasing-safe to use these pointers with _mm256_load/store (which alias anything)
// unless C allows `double*` to alias `double complex*`
const double *flat_filt = (const double*)filter->data;
      double *flat_data =       (double*)data->data;

for (...) {
    //union m256d a[2];
    //union m256d b[2];
    //union m256d r[2];

       //memcpy(a, &(  data->data[i*stride+j]), 2*sizeof(*a));
    __m256d a0 = _mm256_load_pd(0 + &flat_data[i*stride+j]);
    __m256d a1 = _mm256_load_pd(4 + &flat_data[i*stride+j]);
       //memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b));
    __m256d b0 = _mm256_load_pd(0 + &flat_filt[i*stride+j]);
    __m256d b1 = _mm256_load_pd(4 + &flat_filt[i*stride+j]);
       // +4 doubles = +32 bytes = 1 YMM vector = +2 double complex

    __m256d r0 = _mm256_mul_pd(a0, b0);
    __m256d r1 = _mm256_mul_pd(a1, b1);

       // memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r));
    _mm256_store_pd(0 + &flat_data[i*stride+j], r0);
    _mm256_store_pd(4 + &flat_data[i*stride+j], r1);
}

If you wanted an unaligned load/store, you'd use _mm256_loadu_pd / storeu.

Or you could have just cast your double complex* to a __m256d* and dereferenced that directly. In GCC, that's equivalent to an aligned-load intrinsic. But the usual convention is to use load/store intrinsics.


To answer the title question, though, you can help gcc auto-vectorize by telling it when a pointer is guaranteed to be aligned:

data = __builtin_assume_aligned(data, 64);

In C++, you need to cast the result, but in C void* is freely castable.

See How to tell GCC that a pointer argument is always double-word-aligned? and https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.

This is of course specific to GNU C/C++ dialects (clang, gcc, icc), not portable to MSVC or other compilers that don't support GNU extensions.


So far I use roughly similar construction for all operations over the array.

Looping over the array multiple times is usually worse than doing as much as possible in a single pass. Even if it all stays hot in L1D, just the extra load and store instructions are a bottleneck compared to doing more while your data is in registers.

like image 112
Peter Cordes Avatar answered Nov 01 '22 16:11

Peter Cordes