Tiny SSE addpd loop slightly slower than scalar on AMD Phenom II?

Question

Yes, I read SIMD code runs slower than scalar code. No, it's not really a duplicate.

I have been using 2D math stuff for a while, and in the process of porting my codebase from C to C++. There are a few walls I've hit with C that mean I really need polymorphism, but that's another story. Anyway, I considered this a while ago, but it presented a perfect opportunity to use a 2D vector class, including SSE implementations of the common math operations. Yes, I know there are libraries out there, but I wanted to try it myself to understand what's going on, and I don't use anything more complicated than +=.

My implementation is via <immintrin.h>, with a

union {
    __m128d ss;
    struct {
        double x;
        double y;
    }
}

SSE seemed slow, so I looked at its generated ASM output. After fixing something stupid pointerwise, I ended up with the following sets of instructions, run a billion times in a loop: (Processor is an AMD Phenom II at 3.7GHz)

SSE enabled: 1.1 to 1.8 seconds (varies)

add      $0x1, %eax
addpd    %xmm0, %xmm1
cmp      $0x3b9aca00, %eax
jne      4006c8

SSE disabled: 1.0 seconds (pretty constant)

add      $0x1, %eax
addsd    %xmm0, %xmm3
cmp      $0x3b9aca00, %eax
addsd    %xmm2, %xmm1
jne      400630

The only conclusion I can draw from this is that addsd is faster than addpd, and that pipelining means that the extra instruction is compensated for by the ability to do more faster things partially overlapping.

So my question is: is this worth it, and in practice will it actually help, or should I just not bother with the stupid optimization and let the compiler handle it in scalar mode?

Joel Falcou · Accepted Answer

This require more loop unrolling and maybe cache prefetching. Your arithmetic density is very low : 1 operation for 2 memory operations so you need to jam as much of these in your pipeline as possible.

Also don't use union but __m128d directly and use _mm_load_pd to fill your __m128 from your data. _m128 in union generate bad code where all element are doing a stack-register-stack dance which is detrimental.

Tiny SSE addpd loop slightly slower than scalar on AMD Phenom II?

Tags:

c++

c

gcc

assembly

sse

zebediah49

1 Answers

Joel Falcou

Recent Activity

Donate For Us

Tiny SSE addpd loop slightly slower than scalar on AMD Phenom II?

Tags:

c++

c

gcc

assembly

sse

zebediah49

1 Answers

Joel Falcou

Related questions

Recent Activity

Donate For Us