SSE and AVX intrinsics mixture

Question

In addition to SSE-copy, AVX-copy and std::copy performance. Suppose that we need to vectorize some loop in following manner: 1) vectorize first loop-batch (which is multiple by 8) via AVX. 2) split loop's remainder into two batches. Vectorize the batch that is a multiple of 4 via SSE. 3) Process residual batch of entire loop via serial routine. Let's consider example of copying arrays:

#include <immintrin.h>

template<int length,
         int unroll_bound_avx = length & (~7),
         int unroll_tail_avx  = length - unroll_bound_avx,
         int unroll_bound_sse = unroll_tail_avx & (~3),
         int unroll_tail_last = unroll_tail_avx - unroll_bound_sse>
void simd_copy(float *src, float *dest)
{
    auto src_  = src;
    auto dest_ = dest;

    //Vectorize first part of loop via AVX
    for(; src_!=src+unroll_bound_avx; src_+=8, dest_+=8)
    {
         __m256 buffer = _mm256_load_ps(src_);
         _mm256_store_ps(dest_, buffer);
    }

    //Vectorize remainder part of loop via SSE
    for(; src_!=src+unroll_bound_sse+unroll_bound_avx; src_+=4, dest_+=4)
    {
        __m128 buffer = _mm_load_ps(src_);
        _mm_store_ps(dest_, buffer);
    }

    //Process residual elements
    for(; src_!=src+length; ++src_, ++dest_)
        *dest_ = *src_;
}

int main()
{  
    const int sz = 15;
    float *src = (float *)_mm_malloc(sz*sizeof(float), 16);
    float *dest = (float *)_mm_malloc(sz*sizeof(float), 16);
    float a=0;
    std::generate(src, src+sz, [&](){return ++a;});

    simd_copy<sz>(src, dest);

    _mm_free(src);
    _mm_free(dest);
}

Is it correct to use both SSE and AVX? Do I need to avoid AVX-SSE transitions?

Mysticial · Accepted Answer

You can mix SSE and AVX intrinsics all you want.

The only thing you want to make sure is to specify the correct compiler flag to enable AVX.

GCC: -mavx
Visual Studio: /arch:AVX

Failing to do so will either result in the code not compiling (GCC), or in the case of Visual Studio,
this kind of crap:

Using AVX CPU instructions: Poor performance without "/arch:AVX"

What the flag does is that it forces all SIMD instructions to use VEX encoding to avoid the state-switching penalties described in the question above.

Leeor · Answer

I humbly beg to differ - I would advise to try not to mix SSE and AVX, please read in the link Mystical wrote, it warns against such a mixture (although not stressing it hard enough). The question there is about different code paths for different machines according to AVX support, so there's no mixture - in your case the mix is very fine grained and would be destructive (incure internal delays due to the micro-architectural implementation).

To clarify - Mystical is right about the vex prefix in compilation, without it you'd be in a pretty bad shape as you incure SSE2AVX assists everytime since the upper parts of your YMM registers can't be ignored (unless explicitly using vzeroupper). However, there are more subtle effects even when using 128b AVX mixed with 256b AVX.

I also don't see the benefit of using SSE here, in you have a long loop (say N>100) you could get the benefit from AVX for the most part of it, and do the remainder in scalar code up to 7 iterations (you code may still have to do 3 of them). The performance loss is nothing compared to mixing AVX/SSE

Some more info on mixture - http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

SSE and AVX intrinsics mixture

Tags:

c++

performance

avx

simd

sse

gorill

2 Answers

Mysticial

Leeor

Recent Activity

Donate For Us

SSE and AVX intrinsics mixture

Tags:

c++

performance

avx

simd

sse

gorill

2 Answers

Mysticial

Leeor

Related questions

Recent Activity

Donate For Us