Why is there no SIMD functionality in the C++ standard library?

Tags:

SSE has been around since 1999 and it and its following extensions are one of the most powerful tools for improving the performance of your C++ program. Yet there is no standardized containers/algorithms etc. that make explicit use of this ( that I am aware of ? ). Is there a reason for this? Was there a proposal that never made it through?

624

asked Dec 17 '19 12:12

Yamahari

1 Answers

There is experimental support in the (parallelism TS v2) for explicit short vector SIMD types that map to the SIMD extensions of common ISAs but only GCC implements it as of August 2021. The Cppreference documentation for it linked above is incomplete but there are additional details covered in Working Draft, Technical Specification for C++ Extensions for Parallelism, Document N4808. The ideas behind this proposal were developed during a PhD project (2015 thesis here). The author of the GCC implementation wrote an article on converting an existing SSE string processing algorithm to use a 2019 iteration of his library, achieving similar performance and much greater readability. Here's some simple code using it and the generated assembly:

Multiply-add

#include <experimental/simd> // Fails on MSVC 19 and others
using vec4f = std::experimental::fixed_size_simd<float,4>;

void madd(vec4f& out, const vec4f& a, const vec4f& b)
{
    out += a * b;
}

Compiling with -march=znver2 -Ofast -ffast-math we do get a hardware fused multiply-add generated for this:

madd(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> > const&):
        vmovaps xmm0, XMMWORD PTR [rdx]
        vmovaps xmm1, XMMWORD PTR [rdi]
        vfmadd132ps     xmm0, xmm1, XMMWORD PTR [rsi]
        vmovaps XMMWORD PTR [rdi], xmm0
        ret

Dot Product

A dot/inner product can be written tersely:

float dot_product(const vec4f a, const vec4f b)
{
    return reduce(a * b);
}

-Ofast -ffast-math -march=znver2:

dot_product(std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >):
        vmovaps xmm1, XMMWORD PTR [rsi]
        vmulps  xmm1, xmm1, XMMWORD PTR [rdi]
        vpermilps       xmm0, xmm1, 27
        vaddps  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 3
        vaddps  xmm0, xmm0, xmm1
        ret

(Godbolt link with some more playing around).

192

answered Sep 22 '22 23:09

ahcox

Related questions
                            
                                Count of combinations with repetitions
                            
                                Error linking boost with Visual Studio and vcpkg
                            
                                deduction guides and injected class names
                            
                                The declaration `using namespace C;` is essential to prove the results shown in the Example in [namespace.udir]/3
                            
                                Using automatic deduction with unique_ptr and custom deleter
                            
                                What is the difference between cv2.StereoSGBM_create() and cv2.StereoBM_create() functions for disparity mapping on opencv3?
                            
                                Synchronizing against relaxed atomics
                            
                                How is gcc optimizing this loop?
                            
                                Iterating over container or range - problem with constness
                            
                                constexpr function with unused reference argument – gcc vs clang
                            
                                What is the correct way to compile multiple test sources with Catch2?
                            
                                Unknown type name uint64_t and uint16_t uint8_t in Linux [closed]
                            
                                fastest way to convert two-bit number to low-memory representation
                            
                                A safe, standard-compliant way to make a class template specialization fail to compile using `static_assert` only if it is instantiated?
                            
                                What causes this weird behavior with throwing destructors during unwinding of a try-block?
                            
                                CUDA(GPU) as OpenCV backend
                            
                                Calling FFTW's in-place real-to-complex transform without violating strict aliasing rules
                            
                                How to initialize a matrix ONCE in a constexpr constructor?
                            
                                C++17 split constexpr string on comma and have the number of elements at compile time?
                            
                                Is there any way to cast a std::any containing a derived pointer to a base pointer, without knowing the derived type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With