SSE has been around since 1999 and it and its following extensions are one of the most powerful tools for improving the performance of your C++ program. Yet there is no standardized containers/algorithms etc. that make explicit use of this ( that I am aware of ? ). Is there a reason for this? Was there a proposal that never made it through?
One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there's a high chance your smartphone has it too, through the architecture extension ARM NEON.
The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.
SIMD processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel.
There is experimental support in the (parallelism TS v2) for explicit short vector SIMD types that map to the SIMD extensions of common ISAs but only GCC implements it as of August 2021. The Cppreference documentation for it linked above is incomplete but there are additional details covered in Working Draft, Technical Specification for C++ Extensions for Parallelism, Document N4808. The ideas behind this proposal were developed during a PhD project (2015 thesis here). The author of the GCC implementation wrote an article on converting an existing SSE string processing algorithm to use a 2019 iteration of his library, achieving similar performance and much greater readability. Here's some simple code using it and the generated assembly:
#include <experimental/simd> // Fails on MSVC 19 and others
using vec4f = std::experimental::fixed_size_simd<float,4>;
void madd(vec4f& out, const vec4f& a, const vec4f& b)
{
out += a * b;
}
Compiling with -march=znver2 -Ofast -ffast-math
we do get a hardware fused multiply-add generated for this:
madd(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&):
vmovaps xmm0, XMMWORD PTR [rdx]
vmovaps xmm1, XMMWORD PTR [rdi]
vfmadd132ps xmm0, xmm1, XMMWORD PTR [rsi]
vmovaps XMMWORD PTR [rdi], xmm0
ret
A dot/inner product can be written tersely:
float dot_product(const vec4f a, const vec4f b)
{
return reduce(a * b);
}
-Ofast -ffast-math -march=znver2
:
dot_product(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >):
vmovaps xmm1, XMMWORD PTR [rsi]
vmulps xmm1, xmm1, XMMWORD PTR [rdi]
vpermilps xmm0, xmm1, 27
vaddps xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 3
vaddps xmm0, xmm0, xmm1
ret
(Godbolt link with some more playing around).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With