In my program (written in plain C) I have a structure which holds data prepared to be transformed by vectorized (AVX only) radix-2 2D fast fourier transform. The structure looks like this: <pre class="prettyprint"><code>struct data { double complex *data; unsigned int width; unsigned int height; unsigned int stride; }; </code></pre> Now I need to load data from memory as fast as possible. As far as I know there exists unaligned and aligned load to ymm registers (vmovapd and vmovupd instructions) and I would like the program to use the aligned version as its faster. So far I use roughly similar construction for all operations over the array. This example is part of program when data and filter are both already transformed to frequency domain and the filter is applied to data by element by element multiplication. <pre class="prettyprint"><code>union m256d { __m256d reg; double d[4]; }; struct data *data, *filter; /* Load data and filter here, both have the same width, height and stride. */ unsigned int stride = data->stride; for(unsigned int i = 0; i<data->height; i++) { for(unsigned int j = 0; j<data->width; j+=4) { union m256d a[2]; union m256d b[2]; union m256d r[2]; memcpy(a, &( data->data[i*stride+j]), 2*sizeof(*a)); memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b)); r[0].reg = _mm256_mul_pd(a[0].reg, b[0].reg); r[1].reg = _mm256_mul_pd(a[1].reg, b[1].reg); memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r)); } } </code></pre> As expected memcpy calls are optimized. However after observation gcc translates memcpy either to two vmovupd instructions or to bunch of movq instructions which load data to guaranteedly aligned place on stack and then two vmovapd instructions which load it to ymm registers. This behavior depends whether the memcpy prototype is defined or not (if it is defined then gcc uses movq and vmovapd). I am able to ensure that the data in memory are aligned but I am not sure how to tell gcc that it can just use movapd instructions to load data from memory straight to ymm registers. I strongly suspect that gcc does not know the fact that data pointed by <code>&(data->data[i*stride+j])</code> are always aligned. Is there any option how to tell gcc that the data pointed to by a pointer will always be aligned?

<code>vmovupd</code> is exactly as fast as <code>vmovapd</code> when the data is in fact aligned at runtime. The only difference is that <code>vmovapd</code> faults when the data isn't aligned. (See optimization links in the x86 tag wiki, especially Agner Fog's optimization and microarch pdfs, and Intel's optimization manual. You only have a problem if it ever uses multiple instructions instead of one. <hr> Since you're using Intel intrinsics for <code>_mm256_mul_pd</code>, use load/store intrinsics instead of memcpy! See the sse tag wiki for intrinsics guides and more. <pre class="prettyprint"><code>// Hoist this outside the loop, // mostly for readability; should optimize fine either way. // Probably only aliasing-safe to use these pointers with _mm256_load/store (which alias anything) // unless C allows `double*` to alias `double complex*` const double *flat_filt = (const double*)filter->data; double *flat_data = (double*)data->data; for (...) { //union m256d a[2]; //union m256d b[2]; //union m256d r[2]; //memcpy(a, &( data->data[i*stride+j]), 2*sizeof(*a)); __m256d a0 = _mm256_load_pd(0 + &flat_data[i*stride+j]); __m256d a1 = _mm256_load_pd(4 + &flat_data[i*stride+j]); //memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b)); __m256d b0 = _mm256_load_pd(0 + &flat_filt[i*stride+j]); __m256d b1 = _mm256_load_pd(4 + &flat_filt[i*stride+j]); // +4 doubles = +32 bytes = 1 YMM vector = +2 double complex __m256d r0 = _mm256_mul_pd(a0, b0); __m256d r1 = _mm256_mul_pd(a1, b1); // memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r)); _mm256_store_pd(0 + &flat_data[i*stride+j], r0); _mm256_store_pd(4 + &flat_data[i*stride+j], r1); } </code></pre> If you wanted an unaligned load/store, you'd use <code>_mm256_loadu_pd</code> / <code>storeu</code>. Or you could have just cast your <code>double complex*</code> to a <code>__m256d*</code> and dereferenced that directly. In GCC, that's equivalent to an aligned-load intrinsic. But the usual convention is to use load/store intrinsics. <hr> To answer the title question, though, you can help gcc auto-vectorize by telling it when a pointer is guaranteed to be aligned: <pre class="prettyprint"><code>data = __builtin_assume_aligned(data, 64); </code></pre> In C++, you need to cast the result, but in C <code>void*</code> is freely castable. See How to tell GCC that a pointer argument is always double-word-aligned? and https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html. This is of course specific to GNU C/C++ dialects (clang, gcc, icc), not portable to MSVC or other compilers that don't support GNU extensions. <hr> <blockquote> So far I use roughly similar construction for all operations over the array. </blockquote> Looping over the array multiple times is usually worse than doing as much as possible in a single pass. Even if it all stays hot in L1D, just the extra load and store instructions are a bottleneck compared to doing more while your data is in registers.

How to tell gcc that the data pointed to by a pointer will always be aligned?

Tags:

c

memory

gcc

avx

In my program (written in plain C) I have a structure which holds data prepared to be transformed by vectorized (AVX only) radix-2 2D fast fourier transform. The structure looks like this:

struct data {
    double complex *data;
    unsigned int width;
    unsigned int height;
    unsigned int stride;
};

Now I need to load data from memory as fast as possible. As far as I know there exists unaligned and aligned load to ymm registers (vmovapd and vmovupd instructions) and I would like the program to use the aligned version as its faster.

So far I use roughly similar construction for all operations over the array. This example is part of program when data and filter are both already transformed to frequency domain and the filter is applied to data by element by element multiplication.

union m256d {
    __m256d reg;
    double d[4];

};

struct data *data, *filter;
/* Load data and filter here, both have the same width, height and stride. */

unsigned int stride = data->stride;
for(unsigned int i = 0; i<data->height; i++) {
    for(unsigned int j = 0; j<data->width; j+=4) {
        union m256d a[2];
        union m256d b[2];
        union m256d r[2];

        memcpy(a, &(  data->data[i*stride+j]), 2*sizeof(*a));
        memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b));

        r[0].reg = _mm256_mul_pd(a[0].reg, b[0].reg);
        r[1].reg = _mm256_mul_pd(a[1].reg, b[1].reg);

        memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r));
    }
}

As expected memcpy calls are optimized. However after observation gcc translates memcpy either to two vmovupd instructions or to bunch of movq instructions which load data to guaranteedly aligned place on stack and then two vmovapd instructions which load it to ymm registers. This behavior depends whether the memcpy prototype is defined or not (if it is defined then gcc uses movq and vmovapd).

I am able to ensure that the data in memory are aligned but I am not sure how to tell gcc that it can just use movapd instructions to load data from memory straight to ymm registers. I strongly suspect that gcc does not know the fact that data pointed by &(data->data[i*stride+j]) are always aligned.

Is there any option how to tell gcc that the data pointed to by a pointer will always be aligned?

209

asked Sep 14 '17 21:09

Kostrahb

1 Answers

vmovupd is exactly as fast as vmovapd when the data is in fact aligned at runtime. The only difference is that vmovapd faults when the data isn't aligned. (See optimization links in the x86 tag wiki, especially Agner Fog's optimization and microarch pdfs, and Intel's optimization manual.

You only have a problem if it ever uses multiple instructions instead of one.

Since you're using Intel intrinsics for _mm256_mul_pd, use load/store intrinsics instead of memcpy! See the sse tag wiki for intrinsics guides and more.

// Hoist this outside the loop,
// mostly for readability; should optimize fine either way.
// Probably only aliasing-safe to use these pointers with _mm256_load/store (which alias anything)
// unless C allows `double*` to alias `double complex*`
const double *flat_filt = (const double*)filter->data;
      double *flat_data =       (double*)data->data;

for (...) {
    //union m256d a[2];
    //union m256d b[2];
    //union m256d r[2];

       //memcpy(a, &(  data->data[i*stride+j]), 2*sizeof(*a));
    __m256d a0 = _mm256_load_pd(0 + &flat_data[i*stride+j]);
    __m256d a1 = _mm256_load_pd(4 + &flat_data[i*stride+j]);
       //memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b));
    __m256d b0 = _mm256_load_pd(0 + &flat_filt[i*stride+j]);
    __m256d b1 = _mm256_load_pd(4 + &flat_filt[i*stride+j]);
       // +4 doubles = +32 bytes = 1 YMM vector = +2 double complex

    __m256d r0 = _mm256_mul_pd(a0, b0);
    __m256d r1 = _mm256_mul_pd(a1, b1);

       // memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r));
    _mm256_store_pd(0 + &flat_data[i*stride+j], r0);
    _mm256_store_pd(4 + &flat_data[i*stride+j], r1);
}

If you wanted an unaligned load/store, you'd use _mm256_loadu_pd / storeu.

Or you could have just cast your double complex* to a __m256d* and dereferenced that directly. In GCC, that's equivalent to an aligned-load intrinsic. But the usual convention is to use load/store intrinsics.

To answer the title question, though, you can help gcc auto-vectorize by telling it when a pointer is guaranteed to be aligned:

data = __builtin_assume_aligned(data, 64);

In C++, you need to cast the result, but in C void* is freely castable.

See How to tell GCC that a pointer argument is always double-word-aligned? and https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.

This is of course specific to GNU C/C++ dialects (clang, gcc, icc), not portable to MSVC or other compilers that don't support GNU extensions.

So far I use roughly similar construction for all operations over the array.

Looping over the array multiple times is usually worse than doing as much as possible in a single pass. Even if it all stays hot in L1D, just the extra load and store instructions are a bottleneck compared to doing more while your data is in registers.

112

answered Nov 01 '22 16:11

Peter Cordes

Related questions
                            
                                Why there are no "unsigned wchar_t" and "signed wchar_t" types?
                            
                                disable precompiled header for a single file
                            
                                Do GCC and Clang optimize field-by-field struct copy?
                            
                                Why is sprintf with possible buffer overflow allowed in Linux hwmon?
                            
                                Using open/close braces in macros to enforce pairing in C
                            
                                Is the restrict keyword meaningless on parameters of unique pointer types?
                            
                                Why does this program with fork print twice? [duplicate]
                            
                                c - fopen opening directories?
                            
                                Why the carriage return character is not considered as a white space character by the preprocessor
                            
                                Data overflow while comparing the values
                            
                                Frequency function in C
                            
                                MPI_Barrier with MPI_Gather using small vs. large data set sizes? [duplicate]
                            
                                AVX 4-bit integers
                            
                                PortAudio real-time audio processing for continuous input stream
                            
                                Assembly: Memory address of variables in C Programming
                            
                                valgrind: Why is my tiny programming allocating so much space?
                            
                                C - Swap a bit between two numbers
                            
                                mingw-64 - Install package
                            
                                How to calculate CRC using this code?
                            
                                GLFW Mouse event lag with window drag

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With