I have a large piece of code, part of whose body contains this piece of code:
result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);
which I have vectorized as follows (everything is already a float
):
__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
_mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
_mm_extract_ps(r,0), _mm_extract_ps(r,1),
_mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);
The result is correct; however, my benchmarking shows that the vectorized version is slower:
result
to 0
directly (and removing this part of the code entirely) reduces the entire process to 2500 msGiven that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?
(I'm on a mobile Core i5.)
You are spending a lot of time moving scalar values to/from SSE registers with _mm_set_ps
and _mm_extract_ps
- this is generating a lot of instructions, the execution time of which will far outweigh any benefit from using _mm_mul_ps
. Take a look at the generated assembly output to see how much code is being generated in addition to the single MULPS
instruction.
To vectorize this properly you need to use 128 bit SSE loads and stores (_mm_load_ps
/_mm_store_ps
) and then use SSE shuffle instructions to move elements around within registers where needed.
One further point to note - modern CPUs such as Core i5, Core i7, have two scalar FPUs and can issue 2 floating point multiplies per clock. The potential benefit from SSE for single precision floating point is therefore only 2x at best. It's easy to lose most/all of this 2x benefit if you have excessive "housekeeping" instructions, as is the case here.
There are several problems :
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With