I am testing out a very simple program that is using C++ expression templates to simplify writing SSE2 and AVX code that operates on arrays of values. I have a class <code>svec</code> which represents an array of values. I have a class <code>sreg</code> that represents an SSE2 double register. I have <code>expr</code> and <code>add_expr</code> representing the addition of <code>svec</code> arrays. The compiler produces three extra instructions per loop for my expression template test case compared to hand rolled code. I was wondering if there is a reason for this, or any changes I can make to get he compiler to produce the same output? The full test harness is: <pre class="prettyprint"><code>#include <iostream> #include <emmintrin.h> struct sreg { __m128d reg_; sreg() {} sreg(const __m128d& r) : reg_(r) { } sreg operator+(const sreg& b) const { return _mm_add_pd(reg_, b.reg_); } }; template <typename T> struct expr { sreg operator[](std::size_t i) const { return static_cast<const T&>(*this).operator[](i); } operator const T&() const { return static_cast<const T&>(*this); } }; template <typename A, typename B> struct add_expr : public expr<add_expr<A, B>> { const A& a_; const B& b_; add_expr(const A& a, const B& b) : a_{ a }, b_{ b } { } sreg operator[](std::size_t i) const { return a_[i] + b_[i]; } }; template <typename A, typename B> inline auto operator+(const expr<A>& a, const expr& b) { return add_expr<A, B>(a, b); } struct svec : public expr<svec> { sreg* regs_; std::size_t size_; svec(std::size_t size) : size_{ size } { regs_ = static_cast<sreg*>(_aligned_malloc(size * 32, 32)); } ~svec() { _aligned_free(regs_); } template <typename T> svec& operator=(const T& expression) { for (std::size_t i = 0; i < size(); i++) { regs_[i] = expression[i]; } return *this; } const sreg& operator[](std::size_t index) const { return regs_[index]; } sreg& operator[](std::size_t index) { return regs_[index]; } std::size_t size() const { return size_; } }; static constexpr std::size_t size = 64; int main() { svec a(size); svec b(size); svec c(size); svec d(size); svec vec(size); //hand rolled loop for (std::size_t j = 0; j < size; j++) { vec[j] = a[j] + b[j] + c[j] + d[j]; } //expression templates version of hand rolled loop vec = a + b + c + d; std::cout << "Done..."; std::getchar(); return EXIT_SUCCESS; } </code></pre> For the hand rolled loop the instructions are: <pre class="prettyprint"><code>00007FF621CD1B70 mov r8,qword ptr [c] 00007FF621CD1B75 mov rdx,qword ptr [b] 00007FF621CD1B7A mov rax,qword ptr [a] 00007FF621CD1B7F vmovupd xmm0,xmmword ptr [rcx+rax] 00007FF621CD1B84 vaddpd xmm1,xmm0,xmmword ptr [rdx+rcx] 00007FF621CD1B89 vaddpd xmm3,xmm1,xmmword ptr [r8+rcx] 00007FF621CD1B8F lea rax,[rcx+rbx] 00007FF621CD1B93 vaddpd xmm1,xmm3,xmmword ptr [r10+rax] 00007FF621CD1B99 vmovupd xmmword ptr [rax],xmm1 00007FF621CD1B9D add rcx,10h 00007FF621CD1BA1 cmp rcx,400h 00007FF621CD1BA8 jb main+0C0h (07FF621CD1B70h) </code></pre> For the expression templates version: <pre class="prettyprint"><code>00007FF621CD1BC0 mov rdx,qword ptr [c] 00007FF621CD1BC5 mov rcx,qword ptr [rcx] 00007FF621CD1BC8 mov rax,qword ptr [r8] 00007FF621CD1BCB vmovupd xmm0,xmmword ptr [r9+rax] 00007FF621CD1BD1 vaddpd xmm1,xmm0,xmmword ptr [rcx+r9] 00007FF621CD1BD7 vaddpd xmm0,xmm1,xmmword ptr [rdx+r9] 00007FF621CD1BDD lea rax,[r9+rbx] 00007FF621CD1BE1 vaddpd xmm0,xmm0,xmmword ptr [rax+r10] 00007FF621CD1BE7 vmovupd xmmword ptr [rax],xmm0 00007FF621CD1BEB add r9,10h 00007FF621CD1BEF cmp r9,400h 00007FF621CD1BF6 jae main+154h (07FF621CD1C04h) # extra instruction 1 00007FF621CD1BF8 mov rcx,qword ptr [rsp+60h] # extra instruction 2 00007FF621CD1BFD mov r8,qword ptr [rsp+58h] # extra instruction 3 00007FF621CD1C02 jmp main+110h (07FF621CD1BC0h) </code></pre> Please note this is minimum verifiable code to specifically demonstrate a problem. The code was compiled using the default Release build settings in Visual Studio 2015 Update 3. Ideas I have discounted: <ul> <li>the order of the loops (I have already switched the hand rolled loop and the expression templates loop to check if the compiler still inserts the extra instructions and it does)</li> <li>the compiler is optimising the hand rolled loop based on the <code>constexpr</code> <code>size</code> (I have already tried test code that prevents the compiler deducing that <code>size</code> is constant to better optimise the hand rolled loop and it makes no difference to the hand rolled loop's instructions).</li> </ul>

Both loops seem to be reloading the array pointers every iteration. (e.g. <code>mov r8, [c]</code> in the first loop). The second version is just doing it even more inefficiently, with two levels on indirection. One of them coming at the end of the loop, after a conditional branch to break out of the loop. Note that one of the changed instructions which you didn't identify as "new" is <code>mov rcx, [rcx]</code>. Register allocation is different between the loops, but those are the array start pointers. It (and the <code>rcx,[rsp+60h]</code> after the store) are replacing <code>mov rax,qword ptr [a]</code>. I assume <code>a</code> is also an offset from RSP, and not actually a label for static storage. <hr> Presumably this is happening because MSVC++ didn't succeed at alias analysis to prove that the stores into <code>vec[j]</code> can't modify any of the pointers. I didn't look carefully at your templates, but if you're introducing an extra level of indirection that you'd expect to optimize away, the problem is that it isn't. The obvious solution is to use a compiler that optimizes better. clang3.9 does well (auto-vectorizing with no reloads of pointers), and gcc optimizes it away completely because the result is not used. But if you're stuck with MSVC, see if there are any strict-aliasing options, or no-aliasing keywords or declarations, that would be helpful. e.g. GNU C++ extensions include <code>__restrict__</code> to get the same "this doesn't alias" behaviour as C99's <code>restrict</code> keyword. IDK if MSVC supports anything like that. <hr> Nit-pick: It's not quite right to call <code>jae</code> an "extra" instruction. It's just the opposite predicate from <code>jb</code>, so now it's a <code>while(true){ ... if() break; reload; }</code> loop instead of a more-efficient <code>do{...}while()</code> loop. (I'm using C syntax to show the asm loop structure. Obviously if you actually compiled those C loops, the compiler could optimize them.) So if anything, the "extra instruction" is the unconditional branch, JMP.

Simple C++ expression templates wrapping intrinsics produces different instructions

Tags:

c++

intrinsics

I am testing out a very simple program that is using C++ expression templates to simplify writing SSE2 and AVX code that operates on arrays of values.

I have a class svec which represents an array of values.

I have a class sreg that represents an SSE2 double register.

I have expr and add_expr representing the addition of svec arrays.

The compiler produces three extra instructions per loop for my expression template test case compared to hand rolled code. I was wondering if there is a reason for this, or any changes I can make to get he compiler to produce the same output?

The full test harness is:

#include <iostream>
#include <emmintrin.h>

struct sreg
{
    __m128d reg_;

    sreg() {}

    sreg(const __m128d& r) :
        reg_(r)
    {
    }

    sreg operator+(const sreg& b) const
    {
        return _mm_add_pd(reg_, b.reg_);
    }
};

template <typename T>
struct expr
{
    sreg operator[](std::size_t i) const
    {
        return static_cast<const T&>(*this).operator[](i);
    }

    operator const T&() const
    {
        return static_cast<const T&>(*this);
    }
};

template <typename A, typename B>
struct add_expr : public expr<add_expr<A, B>>
{
    const A& a_;
    const B& b_;

    add_expr(const A& a, const B& b) :
        a_{ a }, b_{ b }
    {
    }

    sreg operator[](std::size_t i) const
    {
        return a_[i] + b_[i];
    }
};

template <typename A, typename B>
inline auto operator+(const expr<A>& a, const expr<B>& b)
{
    return add_expr<A, B>(a, b);
}

struct svec : public expr<svec>
{
    sreg* regs_;
    std::size_t size_;

    svec(std::size_t size) :
        size_{ size }
    {
        regs_ = static_cast<sreg*>(_aligned_malloc(size * 32, 32));
    }

    ~svec()
    {
        _aligned_free(regs_);
    }

    template <typename T>
    svec& operator=(const T& expression)
    {
        for (std::size_t i = 0; i < size(); i++)
        {
            regs_[i] = expression[i];
        }

        return *this;
    }

    const sreg& operator[](std::size_t index) const
    {
        return regs_[index];
    }

    sreg& operator[](std::size_t index)
    {
        return regs_[index];
    }

    std::size_t size() const
    {
        return size_;
    }
};

static constexpr std::size_t size = 64;

int main()
{
    svec a(size);
    svec b(size);
    svec c(size);
    svec d(size);
    svec vec(size);

    //hand rolled loop
    for (std::size_t j = 0; j < size; j++)
    {
        vec[j] = a[j] + b[j] + c[j] + d[j];
    }

    //expression templates version of hand rolled loop
    vec = a + b + c + d;

    std::cout << "Done...";

    std::getchar();

    return EXIT_SUCCESS;
}

For the hand rolled loop the instructions are:

00007FF621CD1B70  mov         r8,qword ptr [c]  
00007FF621CD1B75  mov         rdx,qword ptr [b]  
00007FF621CD1B7A  mov         rax,qword ptr [a]  
00007FF621CD1B7F  vmovupd     xmm0,xmmword ptr [rcx+rax]  
00007FF621CD1B84  vaddpd      xmm1,xmm0,xmmword ptr [rdx+rcx]  
00007FF621CD1B89  vaddpd      xmm3,xmm1,xmmword ptr [r8+rcx]  
00007FF621CD1B8F  lea         rax,[rcx+rbx]  
00007FF621CD1B93  vaddpd      xmm1,xmm3,xmmword ptr [r10+rax]  
00007FF621CD1B99  vmovupd     xmmword ptr [rax],xmm1  
00007FF621CD1B9D  add         rcx,10h  
00007FF621CD1BA1  cmp         rcx,400h  
00007FF621CD1BA8  jb          main+0C0h (07FF621CD1B70h)

For the expression templates version:

00007FF621CD1BC0  mov         rdx,qword ptr [c]  
00007FF621CD1BC5  mov         rcx,qword ptr [rcx]  
00007FF621CD1BC8  mov         rax,qword ptr [r8]  
00007FF621CD1BCB  vmovupd     xmm0,xmmword ptr [r9+rax]  
00007FF621CD1BD1  vaddpd      xmm1,xmm0,xmmword ptr [rcx+r9]  
00007FF621CD1BD7  vaddpd      xmm0,xmm1,xmmword ptr [rdx+r9]  
00007FF621CD1BDD  lea         rax,[r9+rbx]  
00007FF621CD1BE1  vaddpd      xmm0,xmm0,xmmword ptr [rax+r10]  
00007FF621CD1BE7  vmovupd     xmmword ptr [rax],xmm0  
00007FF621CD1BEB  add         r9,10h  
00007FF621CD1BEF  cmp         r9,400h  
00007FF621CD1BF6  jae         main+154h (07FF621CD1C04h)  # extra instruction 1
00007FF621CD1BF8  mov         rcx,qword ptr [rsp+60h]     # extra instruction 2
00007FF621CD1BFD  mov         r8,qword ptr [rsp+58h]      # extra instruction 3
00007FF621CD1C02  jmp         main+110h (07FF621CD1BC0h)

Please note this is minimum verifiable code to specifically demonstrate a problem. The code was compiled using the default Release build settings in Visual Studio 2015 Update 3.

Ideas I have discounted:

the order of the loops (I have already switched the hand rolled loop and the expression templates loop to check if the compiler still inserts the extra instructions and it does)
the compiler is optimising the hand rolled loop based on the constexpr size (I have already tried test code that prevents the compiler deducing that size is constant to better optimise the hand rolled loop and it makes no difference to the hand rolled loop's instructions).

409

asked Dec 01 '16 10:12

keith

1 Answers

Both loops seem to be reloading the array pointers every iteration. (e.g. mov r8, [c] in the first loop). The second version is just doing it even more inefficiently, with two levels on indirection. One of them coming at the end of the loop, after a conditional branch to break out of the loop.

Note that one of the changed instructions which you didn't identify as "new" is mov rcx, [rcx]. Register allocation is different between the loops, but those are the array start pointers. It (and the rcx,[rsp+60h] after the store) are replacing mov rax,qword ptr [a]. I assume a is also an offset from RSP, and not actually a label for static storage.

Presumably this is happening because MSVC++ didn't succeed at alias analysis to prove that the stores into vec[j] can't modify any of the pointers. I didn't look carefully at your templates, but if you're introducing an extra level of indirection that you'd expect to optimize away, the problem is that it isn't.

The obvious solution is to use a compiler that optimizes better. clang3.9 does well (auto-vectorizing with no reloads of pointers), and gcc optimizes it away completely because the result is not used.

But if you're stuck with MSVC, see if there are any strict-aliasing options, or no-aliasing keywords or declarations, that would be helpful. e.g. GNU C++ extensions include __restrict__ to get the same "this doesn't alias" behaviour as C99's restrict keyword. IDK if MSVC supports anything like that.

Nit-pick:

It's not quite right to call jae an "extra" instruction. It's just the opposite predicate from jb, so now it's a while(true){ ... if() break; reload; } loop instead of a more-efficient do{...}while() loop. (I'm using C syntax to show the asm loop structure. Obviously if you actually compiled those C loops, the compiler could optimize them.) So if anything, the "extra instruction" is the unconditional branch, JMP.

answered Oct 27 '22 01:10

Peter Cordes

Related questions
                            
                                How to instruct compiler to generate unaligned loads for __m128
                            
                                May I clear a priority_queue by clearing its underlying container?
                            
                                A little hazy about std::ref() and std::bind() with variadic templates
                            
                                rapidjson proper json creation
                            
                                how to get rid of "size_t" in doxygen collaboration diagram
                            
                                Usage of both scalar and array in smart pointer
                            
                                BFS on Adjacency Matrix
                            
                                Exclude base library inclusions from Doxygen dependency graph?
                            
                                Running a Visual Studio 2015 C++ Executable Without Installing the Visual C++ Redistributable Packages [duplicate]
                            
                                Puzzling GCC behaviour with respect to vectorization and loop size
                            
                                When do I use "__attribute__((nonnull))" vs "not_null<T*>"?
                            
                                is_constructible and is_destructible unaffected by friend declarations
                            
                                'Error while decoding payload' after trying to send second payload
                            
                                decltype(auto), trailing return type and sfinae: can we mix them?
                            
                                C++ template overload resolution called from template function pick candidate declared after template instantiation
                            
                                Why does std::ios_base::ignore() set the EOF bit?
                            
                                How does an operator inside a class work?
                            
                                GStreamer appears to be not calling my callbacks
                            
                                How to create a public cython function that can receive c++ struct/instance or python object as parameter?
                            
                                Are checked guard parameter packs cause of ill-formed programs in case of specializations?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With