In addition to SSE-copy, AVX-copy and std::copy performance. Suppose that we need to vectorize some loop in following manner: 1) vectorize first loop-batch (which is multiple by 8) via AVX. 2) split loop's remainder into two batches. Vectorize the batch that is a multiple of 4 via SSE. 3) Process residual batch of entire loop via serial routine. Let's consider example of copying arrays:
#include <immintrin.h>
template<int length,
int unroll_bound_avx = length & (~7),
int unroll_tail_avx = length - unroll_bound_avx,
int unroll_bound_sse = unroll_tail_avx & (~3),
int unroll_tail_last = unroll_tail_avx - unroll_bound_sse>
void simd_copy(float *src, float *dest)
{
auto src_ = src;
auto dest_ = dest;
//Vectorize first part of loop via AVX
for(; src_!=src+unroll_bound_avx; src_+=8, dest_+=8)
{
__m256 buffer = _mm256_load_ps(src_);
_mm256_store_ps(dest_, buffer);
}
//Vectorize remainder part of loop via SSE
for(; src_!=src+unroll_bound_sse+unroll_bound_avx; src_+=4, dest_+=4)
{
__m128 buffer = _mm_load_ps(src_);
_mm_store_ps(dest_, buffer);
}
//Process residual elements
for(; src_!=src+length; ++src_, ++dest_)
*dest_ = *src_;
}
int main()
{
const int sz = 15;
float *src = (float *)_mm_malloc(sz*sizeof(float), 16);
float *dest = (float *)_mm_malloc(sz*sizeof(float), 16);
float a=0;
std::generate(src, src+sz, [&](){return ++a;});
simd_copy<sz>(src, dest);
_mm_free(src);
_mm_free(dest);
}
Is it correct to use both SSE and AVX? Do I need to avoid AVX-SSE transitions?
You can mix SSE and AVX intrinsics all you want.
The only thing you want to make sure is to specify the correct compiler flag to enable AVX.
-mavx
/arch:AVX
Failing to do so will either result in the code not compiling (GCC), or in the case of Visual Studio,
this kind of crap:
What the flag does is that it forces all SIMD instructions to use VEX encoding to avoid the state-switching penalties described in the question above.
I humbly beg to differ - I would advise to try not to mix SSE and AVX, please read in the link Mystical wrote, it warns against such a mixture (although not stressing it hard enough). The question there is about different code paths for different machines according to AVX support, so there's no mixture - in your case the mix is very fine grained and would be destructive (incure internal delays due to the micro-architectural implementation).
To clarify - Mystical is right about the vex prefix in compilation, without it you'd be in a pretty bad shape as you incure SSE2AVX assists everytime since the upper parts of your YMM registers can't be ignored (unless explicitly using vzeroupper). However, there are more subtle effects even when using 128b AVX mixed with 256b AVX.
I also don't see the benefit of using SSE here, in you have a long loop (say N>100) you could get the benefit from AVX for the most part of it, and do the remainder in scalar code up to 7 iterations (you code may still have to do 3 of them). The performance loss is nothing compared to mixing AVX/SSE
Some more info on mixture - http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With