Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to load/store from/to general purpose registers to/from xmm/ymm register

What is best way to load and store generate purpose registers to/from SIMD registers? So far I have been using the stack as a temporary. For example,

mov [rsp + 0x00], r8
mov [rsp + 0x08], r9
mov [rsp + 0x10], r10
mov [rsp + 0x18], r11
vmovdqa ymm0, [rsp] ; stack is properly aligned first.

I don't think there's any instruction can do this directly (or the other direction), since it would mean an instruction with five operands. However, the code above seems silly to me. Is there a better way to do it? I can only think of one alternative, use the pinsrd and related instructions. But it does not seem any better.

The motivation is that, sometime it is faster to do some things in AVX2 while others with general purpose register. For example, say within a small piece of code, there are four 64-bit unsigned integers, I will need four xor, two mulx from BMI2. It will be faster to do the xor with vpxor, however, mulx does not have an AVX2 equivalent. Any performance of gain of vpxor vs 4 xor is lost due to the process of packing and unpacking.

like image 376
Yan Zhou Avatar asked Nov 16 '16 03:11

Yan Zhou


People also ask

Where are general purpose registers stored?

General purpose registers are additional registers that are present in CPU which is used for either memory address or data whenever needed. For example, storing current register content when there is an interruption.

Do general purpose registers store instructions?

General-purpose registers are used to store temporary data within the microprocessor.

What is the register is the most significant byte of the DX general purpose register?

the least significant byte (LSB) or low half of DX is DL and the most significant byte (MSB) or high half of DX is DH.

What are GP registers?

The GP Register is a list of doctors who are eligible for appointment as a general practitioner in the UK. Since 1 April 2006, all doctors working as a GP in the UK health service must be on the GP Register, other than doctors in training, such as GP registrars. This requirement extends to locums.


1 Answers

Is your bottleneck latency, throughput, or fused-domain uops? If it's latency, then store/reload is horrible, because of the store-forwarding stall from narrow stores to a wide load.

For throughput and fused-domain uops, it's not horrible: Just 5 fused-domain uops, bottlenecking on the store port. If the surrounding code is mostly ALU uops, it's worth considering.


For the use-case you propose:

Spending a lot of instructions/uops on moving data between integer and vector regs is usually a bad idea. PMULUDQ does give you the equivalent of a 32-bit mulx, but you're right that 64-bit multiplies aren't available directly in AVX2. (AVX512 has them).

You can do a 64-bit vector multiply using the usual extended-precision techniques with PMULUDQ. My answer on Fastest way to multiply an array of int64_t? found that vectorizing 64 x 64 => 64b multiplies was worth it with AVX2 256b vectors, but not with 128b vectors. But that was with data in memory, not with data starting and ending in vector regs.

In this case, it might be worth building a 64x64 => 128b full multiply out of multiple 32x32 => 64-bit vector multiplies, but it might take so many instructions that it's just not worth it. If you do need the upper-half results, unpacking to scalar (or doing your whole thing scalar) might be best.

Integer XOR is extremely cheap, with excellent ILP (latency=1, throughput = 4 per clock). It's definitely not worth moving your data into vector regs just to XOR it, if you don't have anything else vector-friendly to do there. See the x86 tag wiki for performance links.


Probably the best way for latency is:

vmovq   xmm0, r8
vmovq   xmm1, r10            # 1uop for p5 (SKL), 1c latency
vpinsrq xmm0, r9, 1          # 2uops for p5 (SKL), 3c latency
vpinsrq xmm1, r11, 1
vinserti128 ymm0, ymm0, ymm1, 1    # 1uop for p5 (SKL), 3c latency

Total: 7 uops for p5, with enough ILP to run them almost all back-to-back. Since presumably r8 will be ready a cycle or two sooner than r10 anyway, you're not losing much.


Also worth considering: whatever you were doing to produce r8..r11, do it with vector-integer instructions so your data is already in XMM regs. Then you still need to shuffle them together, though, with 2x PUNPCKLQDQ and VINSERTI128.

like image 141
Peter Cordes Avatar answered Nov 09 '22 05:11

Peter Cordes