Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is there no SIMD functionality in the C++ standard library?

Tags:

c++

stl

simd

SSE has been around since 1999 and it and its following extensions are one of the most powerful tools for improving the performance of your C++ program. Yet there is no standardized containers/algorithms etc. that make explicit use of this ( that I am aware of ? ). Is there a reason for this? Was there a proposal that never made it through?

like image 624
Yamahari Avatar asked Dec 17 '19 12:12

Yamahari


People also ask

Does C++ use SIMD?

One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there's a high chance your smartphone has it too, through the architecture extension ARM NEON.

Does GCC use SIMD?

The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.

What is SIMD optimization?

SIMD processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel.


1 Answers

There is experimental support in the (parallelism TS v2) for explicit short vector SIMD types that map to the SIMD extensions of common ISAs but only GCC implements it as of August 2021. The Cppreference documentation for it linked above is incomplete but there are additional details covered in Working Draft, Technical Specification for C++ Extensions for Parallelism, Document N4808. The ideas behind this proposal were developed during a PhD project (2015 thesis here). The author of the GCC implementation wrote an article on converting an existing SSE string processing algorithm to use a 2019 iteration of his library, achieving similar performance and much greater readability. Here's some simple code using it and the generated assembly:

Multiply-add

#include <experimental/simd> // Fails on MSVC 19 and others
using vec4f = std::experimental::fixed_size_simd<float,4>;

void madd(vec4f& out, const vec4f& a, const vec4f& b)
{
    out += a * b;
}

Compiling with -march=znver2 -Ofast -ffast-math we do get a hardware fused multiply-add generated for this:

madd(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&):
        vmovaps xmm0, XMMWORD PTR [rdx]
        vmovaps xmm1, XMMWORD PTR [rdi]
        vfmadd132ps     xmm0, xmm1, XMMWORD PTR [rsi]
        vmovaps XMMWORD PTR [rdi], xmm0
        ret

Dot Product

A dot/inner product can be written tersely:

float dot_product(const vec4f a, const vec4f b)
{
    return reduce(a * b);
}

-Ofast -ffast-math -march=znver2:

dot_product(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >):
        vmovaps xmm1, XMMWORD PTR [rsi]
        vmulps  xmm1, xmm1, XMMWORD PTR [rdi]
        vpermilps       xmm0, xmm1, 27
        vaddps  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 3
        vaddps  xmm0, xmm0, xmm1
        ret

(Godbolt link with some more playing around).

like image 192
ahcox Avatar answered Sep 22 '22 23:09

ahcox