As part of a self-education project I looked into how g++ handles std::complex
- type and was puzzled by this simple function:
#include <complex>
std::complex<double> c;
void get(std::complex<double> &res){
res=c;
}
Compiled with g++-6.3 -O3
(or also -Os
) for Linux64 I got this result:
movsd c(%rip), %xmm0
movsd %xmm0, (%rdi)
movsd c+8(%rip), %xmm0
movsd %xmm0, 8(%rdi)
ret
So it moves the real and imaginary parts individually as 64bit floats. However, I would expect the assembly to use two movups
instead of four movsd
, i.e. moving the real and imaginary parts simultaneously as a 128bit package:
movups c(%rip), %xmm0
movups %xmm0, (%rdi)
ret
This is not only twice as fast on my machine (Intel Broadwell) as the movsd
-version, but also needs only 16 bytes while the movsd
-version needs 36 bytes.
What is the reason for the g++ creating an assembly with movsd
?
movups
which I should use next to -O3
?movups
I'm not aware of?More context: I try to compare two possible function signatures:
std::complex<double> get(){
return c;
}
and
void get(std::complex<double> &res){
res=c;
}
The first version has to put the real part and the imaginary part into different registers (xmm0
and xmm1
) because of the SystemV ABI. But with the second version one could try to take some advantages of the SSE-operations which works on 128bits, however it does not work with my g++-version.
Edit: As kennytm's answer suggest, the g++ seems to produce non-optimal assembly. It always uses 4 movsd for copying an std::complex from one memory location to another, as for example in
void get(std::complex<double> *res){
res[1]=res[0];
}
There is now a bug-report filed to gcc-bugzilla..
Both clang
and icc
use only one SSE register. You can check the compiled code in https://godbolt.org/g/55lPv0.
get(std::complex<double>&):
movups c(%rip), %xmm0
movups %xmm0, (%rdi)
ret
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With