I am currently writing a C++ template expression library and comparing some instantiations with hand-written code at assembly level. The hand-written function is the following:
spinor multiply(vector const& a, vector const& b)
{
spinor result = {
a.at<1>() * b.at<1>() - a.at<2>() * b.at<2>()
- a.at<4>() * b.at<4>() - a.at<8>() * b.at<8>(),
a.at<1>() * b.at<2>() - a.at<2>() * b.at<1>(),
a.at<1>() * b.at<4>() - a.at<4>() * b.at<1>(),
a.at<1>() * b.at<8>() - a.at<8>() * b.at<1>(),
a.at<2>() * b.at<4>() - a.at<4>() * b.at<2>(),
a.at<2>() * b.at<8>() - a.at<8>() * b.at<2>(),
a.at<4>() * b.at<8>() - a.at<8>() * b.at<4>()
};
return result;
}
The vector
class is just a wrapper over four doubles which can be read by using the at<index>()
member function. Because of design decisions, the indices for the four components are 1, 2, 4, 8
which are accessed with at<index>()
instead of the usual 0, 1, 2, 3
.
The purpose of this function is to return the result of a multiplication of two vectors (in Minkowski-space). If you are familiar with Geometric Algebra, you will see the dot-product (first component of result
, symmetric under exchange of a
and b
) and wedge-product (rest of the components, antisymmetric under exchange of a
and b
). If you are not familiar with Geometric Algebra, just take this function as a prescription for multiplying vectors.
If I compile the function above with GCC 4.7 and look at the disassembly given by objdump -SC a.out
this gives me the following output:
400bc0: movsd 0x8(%rsi),%xmm6
400bc5: mov %rdi,%rax
400bc8: movsd (%rsi),%xmm8
400bcd: movsd 0x8(%rdx),%xmm5
400bd2: movapd %xmm6,%xmm9
400bd7: movsd (%rdx),%xmm7
400bdb: movapd %xmm8,%xmm0
400be0: mulsd %xmm5,%xmm9
400be5: movsd 0x10(%rsi),%xmm4
400bea: mulsd %xmm7,%xmm0
400bee: movsd 0x10(%rdx),%xmm1
400bf3: movsd 0x18(%rdx),%xmm3
400bf8: movsd 0x18(%rsi),%xmm2
400bfd: subsd %xmm9,%xmm0
400c02: movapd %xmm4,%xmm9
400c07: mulsd %xmm1,%xmm9
400c0c: subsd %xmm9,%xmm0
400c11: movapd %xmm3,%xmm9
400c16: mulsd %xmm2,%xmm9
400c1b: subsd %xmm9,%xmm0
400c20: movapd %xmm6,%xmm9
400c25: mulsd %xmm7,%xmm9
400c2a: movsd %xmm0,(%rdi)
400c2e: movapd %xmm5,%xmm0
400c32: mulsd %xmm8,%xmm0
400c37: subsd %xmm9,%xmm0
400c3c: movapd %xmm4,%xmm9
400c41: mulsd %xmm7,%xmm9
400c46: mulsd %xmm2,%xmm7
400c4a: movsd %xmm0,0x8(%rdi)
400c4f: movapd %xmm1,%xmm0
400c53: mulsd %xmm8,%xmm0
400c58: mulsd %xmm3,%xmm8
400c5d: subsd %xmm9,%xmm0
400c62: subsd %xmm7,%xmm8
400c67: movapd %xmm4,%xmm7
400c6b: mulsd %xmm5,%xmm7
400c6f: movsd %xmm0,0x10(%rdi)
400c74: mulsd %xmm2,%xmm5
400c78: movapd %xmm1,%xmm0
400c7c: mulsd %xmm6,%xmm0
400c80: movsd %xmm8,0x18(%rdi)
400c86: mulsd %xmm3,%xmm6
400c8a: mulsd %xmm2,%xmm1
400c8e: mulsd %xmm4,%xmm3
400c92: subsd %xmm7,%xmm0
400c96: subsd %xmm5,%xmm6
400c9a: subsd %xmm1,%xmm3
400c9e: movsd %xmm0,0x20(%rdi)
400ca3: movsd %xmm6,0x28(%rdi)
400ca8: movsd %xmm3,0x30(%rdi)
400cad: retq
400cae: nop
400caf: nop
This looks pretty good to me - the components of the first (%rsi
) and second (%rdx
) vectors are accessed only once and the actual computations are done only in registers. At the end, the result is written at the adress in the register %rdi
. Since this is the first argument register I think a return value optimization is employed here.
Compare this with following listing for the expression template version of the function above:
400cb0: mov (%rsi),%rdx
400cb3: mov 0x8(%rsi),%rax
400cb7: movsd 0x1f1(%rip),%xmm4 # 400eb0 <_IO_stdin_used+0x10>
400cbe:
400cbf: movsd 0x10(%rdx),%xmm3
400cc4: movsd 0x18(%rdx),%xmm0
400cc9: mulsd 0x10(%rax),%xmm3
400cce: xorpd %xmm4,%xmm0
400cd2: mulsd 0x18(%rax),%xmm0
400cd7: movsd 0x8(%rdx),%xmm2
400cdc: movsd (%rdx),%xmm1
400ce0: mulsd 0x8(%rax),%xmm2
400ce5: mulsd (%rax),%xmm1
400ce9: subsd %xmm3,%xmm0
400ced: subsd %xmm2,%xmm0
400cf1: addsd %xmm0,%xmm1
400cf5: movsd %xmm1,(%rdi)
400cf9: movsd (%rdx),%xmm0
400cfd: movsd 0x8(%rdx),%xmm1
400d02: mulsd 0x8(%rax),%xmm0
400d07: mulsd (%rax),%xmm1
400d0b: subsd %xmm1,%xmm0
400d0f: movsd %xmm0,0x8(%rdi)
400d14: movsd (%rdx),%xmm0
400d18: movsd 0x10(%rdx),%xmm1
400d1d: mulsd 0x10(%rax),%xmm0
400d22: mulsd (%rax),%xmm1
400d26: subsd %xmm1,%xmm0
400d2a: movsd %xmm0,0x10(%rdi)
400d2f: movsd 0x8(%rdx),%xmm0
400d34: movsd 0x10(%rdx),%xmm1
400d39: mulsd 0x10(%rax),%xmm0
400d3e: mulsd 0x8(%rax),%xmm1
400d43: subsd %xmm1,%xmm0
400d47: movsd %xmm0,0x18(%rdi)
400d4c: movsd (%rdx),%xmm0
400d50: movsd 0x18(%rdx),%xmm1
400d55: mulsd 0x18(%rax),%xmm0
400d5a: mulsd (%rax),%xmm1
400d5e: subsd %xmm1,%xmm0
400d62: movsd %xmm0,0x20(%rdi)
400d67: movsd 0x8(%rdx),%xmm0
400d6c: movsd 0x18(%rdx),%xmm1
400d71: mulsd 0x18(%rax),%xmm0
400d76: mulsd 0x8(%rax),%xmm1
400d7b: subsd %xmm1,%xmm0
400d7f: movsd %xmm0,0x28(%rdi)
400d84: movsd 0x10(%rdx),%xmm0
400d89: movsd 0x18(%rdx),%xmm1
400d8e: mulsd 0x18(%rax),%xmm0
400d93: mulsd 0x10(%rax),%xmm1
400d98: subsd %xmm1,%xmm0
400d9c: movsd %xmm0,0x30(%rdi)
400da1: retq
The signature of this function is
spinor<product<vector, vector>>(product<vector, vector> const&)
I hope you trust me that both version give the same result. The first two lines extract the first and second vector which are stored as references in product
. I wondered about the following things:
movsd 0x1f1(%rip),%xmm4
in combination with xorpd %xmm4,%xmm0
do? I already found out that this is called "RIP relative addressing", see http://www.x86-64.org/documentation/assembly.html
0x10(%rax)
which is read four times?I also benchmarked both functions by generating 100000000 random vectors and taking the time both functions needed:
ET: 7.5 sec
HW: 6.8 sec
The hand-written function is about 10% faster. Does anyone have experience with expression templates and knows how to make them perform closer to their hand-written counterpart?
It would be clear if we did know for sure the contents of address 0x400eb0
, but I suspect it is 0x8000 0000 0000 0000 8000 0000 0000 0000
, or similar (possibly with a leading 0, because the code is not vectorized), written as 128 bit int.
In that case a xorpd
does change the sign of the second operand.
To why the register read isn't cached - better ask this on the gcc-help mailing list. Possibly the compiler can't prove that the two vectors or an intermediate result do not alias.
But against the general opinion, compilers do not optimize always perfectly, but only better than 90% (or 99% ?) of all programmers (if they try to write assembly), and sometimes (rarely) they produce really slow code.
But your approach is very good - benchmarking and looking a the generated object code is the right thing to do if you want to optimize.
PS: They code could possibly be accelerated by using vector instructions (mulpd
instead of mulsd
), which does multiply two or four doubles in one go), aka SSE or AVX. But some instructions are needed to shuffle the values to the right places in the registers, so the gain is always slower than two or four times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With