Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Process unaligned part of a double array, vectorize the rest

I am generating sse/avx instructions and currently i have to use unaligned load and stores. I operate on a float/double array and i will never know whether it will be aligned or not. So before vectorizing it, i would like to have a pre and possibly a post loop, which takes care about the unaligned part. The main vectorized loop operates then on the aligned part.

But how do i determine when an array is aligned? Can i check the pointer value? When should the pre-loop stop and the post-loop start?

Here is my simple code example:

void func(double * in, double * out, unsigned int size){
    for( as long as in unaligned part ){
        out[i] = do_something_with_array(in[i])
    }
    for( as long as aligned ){
        awesome avx code that loads operates and stores 4 doubles
    }
    for( remaining part of array ){
        out[i] = do_something_with_array(in[i])
    }
 }

Edit: I have been thinking about it. Theoretically the pointer to the i'th element should be dividable (something like &a[i]%16==0) by 2,4,16,32 (depending whether it is double and whether it is sse or avx). So the first loop should cover up the elements, which are not dividable.

Practically i will try the compiler pragmas and flags out, to see what does the compiler produce. If no one gives a good answer i will then post my solution (if any) on weekend.

like image 678
hr0m Avatar asked May 30 '16 10:05

hr0m


1 Answers

Here is some example C code that does what you want

#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>

#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))

int main(void) {
    int n = 17;
    int c = 1;
    double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
    double* p1 = p+c;
    for(int i=0; i<n; i++) p1[i] = 1.0*i;
    double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
    double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
    if(p2>p3) p2 = p3;

    printf("%p %p %p %p\n", p1, p2, p3, p1+n);
    double *t;
    for(t=p1; t<p2; t+=1) {
        printf("a %p %f\n", t, *t);
    }
    puts("");
    for(;t<p3; t+=SIMD_WIDTH) {
        printf("b %p ", t);
        for(int i=0; i<SIMD_WIDTH; i++) printf("%f ", *(t+i));
        puts("");
    }
    puts("");
    for(;t<p1+n; t+=1) {
        printf("c %p %f\n", t, *t);
    }  
}

This generates a 32-byte aligned buffer but then offsets it by one double in size so it's no longer 32-byte aligned. It loops over scalar values until reaching 32-btye alignment, loops over the 32-byte aligned values, and then lastly finishes with another scalar loop for any remaining values which are not a multiple of the SIMD width.


I would argue that this kind of optimization only really makes a lot of sense for Intel x86 processors before Nehalem. Since Nehalem the latency and throughput of unaligned loads and stores are the same as for the aligned loads and stores. Additionally, since Nehalem the costs of the cache line splits is small.

There is one subtle point with SSE since Nehalem in that unaligned loads and stores cannot fold with other operations. Therefore, aligned loads and stores are not obsolete with SSE since Nehalem. So in principle this optimization could make a difference even with Nehalem but in practice I think there are few cases where it will.

However, with AVX unaligned loads and stores can fold so the aligned loads and store instructions are obsolete.


I looked into this with GCC, MSVC, and Clang. GCC if it cannot assume a pointer is aligned to e.g. 16 bytes with SSE then it will generate code similar to the code above to reach 16 byte alignment to avoid the cache line splits when vectorizing.

Clang and MSVC don't do this so they would suffer from the cache-line splits. However, the cost of the additional code to do this makes up for cost of the cache-line splits which probably explains why Clang and MSVC don't worry about it.

The only exception is before Nahalem. In this case GCC is much faster than Clang and MSVC when the pointer is not aligned. If the pointer is aligned and Clang knows it then it will use aligned loads and stores and be fast like GCC. MSVC vectorization still uses unaligned stores and loads and is therefore slow pre-Nahalem even when a pointer is 16-byte aligned.


Here is a version which I think is a bit clearer using pointer differences

#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>

#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))

int main(void) {
    int n = 17, c =1;

    double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
    double* p1 = p+c;
    for(int i=0; i<n; i++) p1[i] = 1.0*i;
    double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
    double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
    int n1 = p2-p1, n2 = p3-p2;
    if(n1>n2) n1=n2;
    printf("%d %d %d\n", n1, n2, n);

    int i;
    for(i=0; i<n1; i++) {
        printf("a %p %f\n", &p1[i], p1[i]);
    }
    puts("");
    for(;i<n2; i+=SIMD_WIDTH) {
        printf("b %p ", &p1[i]);
        for(int j=0; j<SIMD_WIDTH; j++) printf("%f ", p1[i+j]);
        puts("");
    }
    puts("");
    for(;i<n; i++) {
        printf("c %p %f\n", &p1[i], p1[i]);
    }  
}
like image 160
Z boson Avatar answered Nov 19 '22 06:11

Z boson