Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split an XMM 128-bit register into two 64-bit integer registers?

Tags:

x86

assembly

sse

How to split a 128-bit xmm register into two 64-bit quadwords?

I have a very large number in xmm1 and want to get the higher quadword to r9 and lower quadword to r10, or RAX and RDX.

movlpd or movhpd only works with reg to mem or vice versa.

like image 715
Matthias Avatar asked Dec 19 '16 12:12

Matthias


People also ask

How many xmm registers?

There are eight XMM registers available in non -64-bit modes and 16 XMM registers in long mode, which allow simultaneous operations on: 16 bytes.

What is the size in bits of a complete XMM register?

XMM registers, instead, are a completely separate registers set, introduced with SSE and still widely used to this day. They are 128 bit wide, with instructions that can treat them as arrays of 64, 32 (integer and floating point),16 or 8 bit (integer only) values.

What is xmm0 register?

xmm0 is used to return values from functions, and as the first function argument. They're *all* scratch registers; a perfectly anarchist setup. Save things to memory before calling any functions, because everything can get trashed! A brand new set of instructions, like "movss" and "addss".

Which xmm registers are used to pass floating point arguments into a function?

xmm0 through xmm7 are used for passing floating-point arguments.


1 Answers

SSE2 (baseline for x86-64) has instructions for moving data directly between XMM and integer registers (without bouncing through memory). It's easy for the low element of a vector: MOVD or MOVQ. To extract higher elements, you can just shuffle the element you want down to the low element of a vector.

SSE4.1 also added insert/extract for sizes other than 16-bit (e.g. PEXTRQ). Other than code-size, it's not actually faster than a separate shuffle and movq on any existing CPUs, but it means you don't need any extra tmp registers.

#SSE4.1
movq    rax, xmm0       # low qword
pextrq  rdx,  xmm0, 1   # high qword
# 128b result in rdx:rax, ready for use with div r64 for example.
# (But watch out for #DE on overflow)
# also ready for returning as a __int128_t in the SystemV x86-64 ABI

#SSE2
movq       r10, xmm0
punpckhqdq xmm0, xmm0    # broadcast the high half of xmm0 to both halves
movq       r9,  xmm0

PUNPCKHQDQ is the most efficient way to do this. It's fast even on old CPUs with slow shuffles for element-sizes smaller than 64-bit, like 65nm Core2 (Merom/Conroe). See my horizontal sum answer for more details about that. PUNPCKHQDQ doesn't have an immediate operand, and is only SSE2, so it's only 4 bytes of code-size.

To preserve the original value of xmm0, use pshufd with a different destination. Or to swap high and low halves in-place, or whatever.


movlpd or movhpd ...

There's no point in ever using them. Use movlps / movhps instead, because they're shorter and no CPUs care about float vs. double.

You can use movhlps xmm1, xmm0 to extract the high half of xmm0 into another register, but mixing FP shuffles with integer-vector operations will cause bypass delays on some CPUs (specifically Intel Nehalem). Also beware of the dependency on xmm1 causing a latency bottleneck.

Definitely prefer pshufd for this in general. But you could use movhlps if you're tuning for a specific CPU like Core2 where movhlps is fast and runs in the integer domain, and pshufd is slow.

like image 52
Peter Cordes Avatar answered Sep 20 '22 13:09

Peter Cordes