I'm aware of the existing penalty for switching from AVX instructions to SSE instructions without first zeroing out the upper halves of all ymm registers, but in my particular case on my machine (i7-3939K 3.2GHz), there seems to be a very large penalty for going the other way around (SSE to AVX), even if I do explicitly use _mm256_zeroupper before and after the AVX code section.
I have written functions for converting between 32 bit floats and 32 bit fixed point integers, on 2 buffers that are 32768 elements wide. I ported an SSE2 intrinsic version directly to AVX to do 8 elements at once over SSE's 4, expecting to see a significant performance increase, but unfortunately, the opposite happened.
So, I have 2 functions:
void ConvertPcm32FloatToPcm32Fixed(int32* outBuffer, const float* inBuffer, uint sampleCount, bool bUseAvx)
{
const float fScale = (float)(1U<<31);
if (bUseAvx)
{
_mm256_zeroupper();
const __m256 vScale = _mm256_set1_ps(fScale);
const __m256 vVolMax = _mm256_set1_ps(fScale-1);
const __m256 vVolMin = _mm256_set1_ps(-fScale);
for (uint i = 0; i < sampleCount; i+=8)
{
const __m256 vIn0 = _mm256_load_ps(inBuffer+i); // Aligned load
const __m256 vVal0 = _mm256_mul_ps(vIn0, vScale);
const __m256 vClamped0 = _mm256_min_ps( _mm256_max_ps(vVal0, vVolMin), vVolMax );
const __m256i vFinal0 = _mm256_cvtps_epi32(vClamped0);
_mm256_store_si256((__m256i*)(outBuffer+i), vFinal0); // Aligned store
}
_mm256_zeroupper();
}
else
{
const __m128 vScale = _mm_set1_ps(fScale);
const __m128 vVolMax = _mm_set1_ps(fScale-1);
const __m128 vVolMin = _mm_set1_ps(-fScale);
for (uint i = 0; i < sampleCount; i+=4)
{
const __m128 vIn0 = _mm_load_ps(inBuffer+i); // Aligned load
const __m128 vVal0 = _mm_mul_ps(vIn0, vScale);
const __m128 vClamped0 = _mm_min_ps( _mm_max_ps(vVal0, vVolMin), vVolMax );
const __m128i vFinal0 = _mm_cvtps_epi32(vClamped0);
_mm_store_si128((__m128i*)(outBuffer+i), vFinal0); // Aligned store
}
}
}
void ConvertPcm32FixedToPcm32Float(float* outBuffer, const int32* inBuffer, uint sampleCount, bool bUseAvx)
{
const float fScale = (float)(1U<<31);
if (bUseAvx)
{
_mm256_zeroupper();
const __m256 vScale = _mm256_set1_ps(1/fScale);
for (uint i = 0; i < sampleCount; i+=8)
{
__m256i vIn0 = _mm256_load_si256(reinterpret_cast<const __m256i*>(inBuffer+i)); // Aligned load
__m256 vVal0 = _mm256_cvtepi32_ps(vIn0);
vVal0 = _mm256_mul_ps(vVal0, vScale);
_mm256_store_ps(outBuffer+i, vVal0); // Aligned store
}
_mm256_zeroupper();
}
else
{
const __m128 vScale = _mm_set1_ps(1/fScale);
for (uint i = 0; i < sampleCount; i+=4)
{
__m128i vIn0 = _mm_load_si128(reinterpret_cast<const __m128i*>(inBuffer+i)); // Aligned load
__m128 vVal0 = _mm_cvtepi32_ps(vIn0);
vVal0 = _mm_mul_ps(vVal0, vScale);
_mm_store_ps(outBuffer+i, vVal0); // Aligned store
}
}
}
So I start a timer, run ConvertPcm32FloatToPcm32Fixed then ConvertPcm32FixedToPcm32Float to convert straight back, end the timer. The SSE2 versions of the functions execute for a total of 15-16 microseconds, but the AVX versions take 22-23 microseconds. A bit perplexed, I dug a bit further, and I have discovered how to speed up the AVX versions so that they go faster than the SSE2 versions, but it's cheating. I simply run ConvertPcm32FloatToPcm32Fixed before starting the timer, then start the timer, and run ConvertPcm32FloatToPcm32Fixed again, then ConvertPcm32FixedToPcm32Float, stop the timer. As if there's a massive penalty for SSE to AVX, if I "prime" the AVX version first with a trial run, the AVX execution time drops to 12 microseconds, while doing the same thing with the SSE equivalents only drops the time down by a microsecond to 14, making AVX the marginal winner here, but only if I cheat. I considered that maybe AVX doesn't play as nicely with the cache as SSE, but using _mm_prefetch does nothing to help it either.
Am I missing something here?
I did not test your code, but since your test appears quite short, maybe you're seeing the Floating point warm-up effect that Agner Fog discusses on p.101 of his microarchitecture manual (this applies to Sandy Bridge architecture). I quote:
The processor is in a cold state when it has not seen any floating point instructions for a while. The latency for 256-bit vector additions and multiplications is initially two clocks longer than the ideal number, then one clock longer, and after several hundred floating point instructions the processor goes to the warm state where latencies are 3 and 5 clocks respectively. The throughput is half the ideal value for 256-bit vector operations in the cold state. 128-bit vector operations are less affected by this warm-up effect. The latency of 128-bit vector additions and multiplications is at most one clock cycle longer than the ideal value, and the throughput is not reduced in the cold state.
I was under the impression that unless the compiler encodes the SSE instructions using the VEX instruction format, as Paul R said - vmulps instead of mulps, the hit is massive.
When optimizing small segments, I tend to use this nice tool from Intel in tandem with some good ol' benchmarks
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
The report generated by IACA includes this notation:
"@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With