Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hint the compiler that float-vector count is divisible by 8?

static inline void R1_sub_R0(float *vec,  size_t cnt,  float toSubtract){
    for(size_t i=0; cnt; ++i){
        vec[i] -= toSubtract;
    }
}

I know that cnt will always be divisible by 8, therefore the code could be vectorized via SSE and AVX. In other words, we can iterate over *vec as a __m256 type. But compiler will probably not know this. How to re-assure the compiler that this count is guaranteed to be divisible by 8?

Will something like this help it? (if we stick it at the start of the function)

assert(((cnt*sizeof(float)) % sizeof(__m256)) ==0 );  //checks that it's "multiple of __m256 type".

Of course, I could have simply written the whole thing as a vectorized code:

static inline void R1_sub_R0(float *vec,  size_t cnt,  float toSubtract){
    assert(cnt*sizeof(float) % sizeof(__m256) == 0);//check that it's "multiple of __m256 type".
    assert(((uintptr_t)(const void *)(POINTER)) % (16) == 0);//assert that 'vec' is 16-byte aligned

    __m256 sToSubtract = _mm256_set1_ps(toSubtract);
    __m256 *sPtr = (__m256*)vec;
    const __m256 *sEnd = (const __m256*)(vec+cnt);

    for(sPtr;  sPtr != sEnd;  ++sPtr){
        *sPtr = _mm256_sub_ps(*sPtr, sToSubtract);
    }
}

However, it runs 10% slower than the original version. So I just want to give the compiler extra bit of information. That way it could vectorize the code even more efficiently.

like image 910
Kari Avatar asked Sep 26 '19 13:09

Kari


1 Answers

Hint the compiler that float-vector count is divisible by 8?

You could semi-unroll the loop by nesting another:

for(size_t i=0; i < cnt; i += 8){
    for(size_t j=0; j < 8; j++){
        vec[i + j] -= toSubtract;
    }
}

The compiler can easily see that the inner loop has constant iterations and can unroll it and potentially use SIMD if it so chooses.

Hint the compiler that float-vector count is [16-byte aligned]?

This is quite a bit more tricky.

You could use something like:

struct alignas(16) sse {
    float arr[8];
};

 // cnt is now number of structs which is 8th fraction of original cnt
R1_sub_R0(sse *vec,  size_t cnt,  float toSubtract) {
    for(size_t i=0; i < cnt; i ++){
        for(size_t j=0; j < 8; j++){
            vec[i].arr[j] -= toSubtract;
        }
    }

Other than that, there are compiler extensions such as __builtin_assume_aligned that can be used with the plain float array.

like image 57
eerorika Avatar answered Oct 01 '22 13:10

eerorika