As part of a self-education project I looked into how g++ handles std::complex - type and was puzzled by this simple function:
#include <complex>
std::complex<double> c;
void get(std::complex<double> &res){
res=c;
}
Compiled with g++-6.3 -O3 (or also -Os) for Linux64 I got this result:
movsd c(%rip), %xmm0
movsd %xmm0, (%rdi)
movsd c+8(%rip), %xmm0
movsd %xmm0, 8(%rdi)
ret
So it moves the real and imaginary parts individually as 64bit floats. However, I would expect the assembly to use two movups instead of four movsd, i.e. moving the real and imaginary parts simultaneously as a 128bit package:
movups c(%rip), %xmm0
movups %xmm0, (%rdi)
ret
This is not only twice as fast on my machine (Intel Broadwell) as the movsd-version, but also needs only 16 bytes while the movsd-version needs 36 bytes.
What is the reason for the g++ creating an assembly with movsd?
movups which I should use next to -O3?movups I'm not aware of?More context: I try to compare two possible function signatures:
std::complex<double> get(){
return c;
}
and
void get(std::complex<double> &res){
res=c;
}
The first version has to put the real part and the imaginary part into different registers (xmm0 and xmm1) because of the SystemV ABI. But with the second version one could try to take some advantages of the SSE-operations which works on 128bits, however it does not work with my g++-version.
Edit: As kennytm's answer suggest, the g++ seems to produce non-optimal assembly. It always uses 4 movsd for copying an std::complex from one memory location to another, as for example in
void get(std::complex<double> *res){
res[1]=res[0];
}
There is now a bug-report filed to gcc-bugzilla..
Both clang and icc use only one SSE register. You can check the compiled code in https://godbolt.org/g/55lPv0.
get(std::complex<double>&):
movups c(%rip), %xmm0
movups %xmm0, (%rdi)
ret
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With