How to split a 128-bit xmm
register into two 64-bit quadwords?
I have a very large number in xmm1
and want to get the higher quadword to r9
and lower quadword to r10
, or RAX
and RDX
.
movlpd
or movhpd
only works with reg to mem or vice versa.
There are eight XMM registers available in non -64-bit modes and 16 XMM registers in long mode, which allow simultaneous operations on: 16 bytes.
XMM registers, instead, are a completely separate registers set, introduced with SSE and still widely used to this day. They are 128 bit wide, with instructions that can treat them as arrays of 64, 32 (integer and floating point),16 or 8 bit (integer only) values.
xmm0 is used to return values from functions, and as the first function argument. They're *all* scratch registers; a perfectly anarchist setup. Save things to memory before calling any functions, because everything can get trashed! A brand new set of instructions, like "movss" and "addss".
xmm0 through xmm7 are used for passing floating-point arguments.
SSE2 (baseline for x86-64) has instructions for moving data directly between XMM and integer registers (without bouncing through memory). It's easy for the low element of a vector: MOVD or MOVQ. To extract higher elements, you can just shuffle the element you want down to the low element of a vector.
SSE4.1 also added insert/extract for sizes other than 16-bit (e.g. PEXTRQ). Other than code-size, it's not actually faster than a separate shuffle and movq on any existing CPUs, but it means you don't need any extra tmp registers.
#SSE4.1
movq rax, xmm0 # low qword
pextrq rdx, xmm0, 1 # high qword
# 128b result in rdx:rax, ready for use with div r64 for example.
# (But watch out for #DE on overflow)
# also ready for returning as a __int128_t in the SystemV x86-64 ABI
#SSE2
movq r10, xmm0
punpckhqdq xmm0, xmm0 # broadcast the high half of xmm0 to both halves
movq r9, xmm0
PUNPCKHQDQ is the most efficient way to do this. It's fast even on old CPUs with slow shuffles for element-sizes smaller than 64-bit, like 65nm Core2 (Merom/Conroe). See my horizontal sum answer for more details about that. PUNPCKHQDQ doesn't have an immediate operand, and is only SSE2, so it's only 4 bytes of code-size.
To preserve the original value of xmm0, use pshufd
with a different destination. Or to swap high and low halves in-place, or whatever.
movlpd or movhpd ...
There's no point in ever using them. Use movlps / movhps instead, because they're shorter and no CPUs care about float vs. double.
You can use movhlps xmm1, xmm0
to extract the high half of xmm0 into another register, but mixing FP shuffles with integer-vector operations will cause bypass delays on some CPUs (specifically Intel Nehalem). Also beware of the dependency on xmm1 causing a latency bottleneck.
Definitely prefer pshufd
for this in general. But you could use movhlps
if you're tuning for a specific CPU like Core2 where movhlps
is fast and runs in the integer domain, and pshufd
is slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With