What's the difference between vextracti128 and vextractf128?

1 Answers

vextracti128 and vextractf128 have not only the same functionality, parameters, and return values. They have the same instruction length. And they have the same throughput (according to Agner Fog's optimization manuals).

What is not completely clear is their latency values (performance in tight loops with dependency chains). Latency of instructions themselves is 3 cycles. But after reading section 2.1.3 ("Execution Engine") of Intel Optimization Manual we may suspect that vextracti128 should get additional 1 clock delay when working with floating point data and vextractf128 should get additional 1 clock delay when working with integer data. Measurements show that this is not true and latency always remains exactly 3 cycles (at least for Haswell processors). And as far as I know this is not documented anywhere in the Optimization Manual.

Still instruction set is only an interface to processor. Haswell is the only implementation of this interface containing both these instructions (for now). We could ignore the fact that implementations of these instructions are (most likely) identical. And use these instructions as intended - vextracti128 for integer data and vextractf128 for FP data. (If we only need to reorder data without performing any int/FP operations, the obvious choice is vextractf128 as it is supported by several older processors). Also experience shows that Intel sometimes decreases performance of some instructions in next generations of CPUs, so it would be wise to observe these instructions' affinity to avoid any possible speed degradation in the future.

Since Intel Optimization Manual is not very detailed describing relationship between int/FP domains for SIMD instructions, I've made some more measurements (on Haswell) and got some interesting results:

Shuffle instructions

There is no additional delay for any transitions between SSE integer and shuffle instructions. And there is no additional delay for any transitions between SSE FP and shuffle instructions. (Though I didn't test every instruction). For example you could insert such "obviously integer" instruction as pshufb between two FP instructions with no extra delay. Inserting shufpd in the middle of integer code also gives no extra delay.

Since vextracti128 and vextractf128 are executed by shuffle unit, they also have this "no delay" property.

This may be useful to optimize mixed int+FP code. If you need to reinterpret FP data as integers and at the same time shuffle the register, just make sure all FP instructions stand before the shuffle and all integer instructions are after it.

FP logical instructions

andps and other FP logical instructions also have the property of ignoring FP/int domains.

If you add integer logical instruction (like pand) into FP code, you get additional 2 cycle delay (one to get to int domain and other one to get back to FP). So the obvious choice for SIMD FP code is andps. The same andps may be used in the middle of integer code without any delays. Even better is to use such instructions right in between int and FP instructions. Interestingly, FP logical instructions are using the same port number 5 as all shuffle instructions.

Register access

Intel Optimization Manual describes bypass delays between producer and consumer micro-ops. But it does not say anything how micro-ops interact with registers.

This piece of code needs only 3 clocks per iteration (just as required by vaddps):

    vxorps ymm7, ymm7, ymm7
_benchloop:
    vaddps ymm0, ymm0, ymm7
    jmp _benchloop

But this one needs 2 clocks per iteration (1 more than needed for vpaddd):

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpaddd ymm0, ymm0, ymm7
    jmp _benchloop

The only difference here are calculations in integer domain instead of FP domain. To get 1 clock/iteration we need to add an instruction:

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpand ymm6, ymm7, ymm7
    vpaddd ymm0, ymm0, ymm6
    jmp _benchloop

Which hints that (1) all values stored in SIMD registers belong to FP domain, and (2) reading from SIMD register increases integer operation's latency by one. (The difference between {ymm0, ymm6} and ymm7 here is that ymm7 is stored in some scratch memory and works as real "register" while ymm0 and ymm6 are temporary and are represented by state of internal CPU's interconnections rather than some permanent storage, so ymm0 and ymm6 are not "read" but just passed between micro-ops).

answered Nov 01 '22 09:11

Evgeny Kluev

Related questions
                            
                                How to force .NET application to run in 32bit mode
                            
                                NASM is pure assembly, but MASM is high level Assembly? [closed]
                            
                                GDB Print Value Relative to Register
                            
                                Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
                            
                                Disabling AVX2 in CPU for testing purposes
                            
                                If statement appears to be evaluating even when condition evaluates to false
                            
                                How does the CPU distinguish 'CALL rel16' (E8 cw) and 'CALL rel32' (E8 cd)?
                            
                                On multicore x86 systems, are mutexes implemented using a LOCK'd instruction?
                            
                                What can a compiler do with branching information?
                            
                                Call an absolute pointer in x86 machine code
                            
                                Identifying signed and unsigned values in assembly
                            
                                What values can the carry flag hold, and how to check its status in x86 assembly?
                            
                                x86 cmpl and jne
                            
                                Good online resources to learn x86 assembly [closed]
                            
                                Where are the stacks for the other threads located in a process virtual address space?
                            
                                Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?
                            
                                GCC Inline Assembly: Jump to label outside block
                            
                                ldd shows varied addresses on x86 Linux
                            
                                Assembler jump in Protected Mode with GDT
                            
                                Why are CISC processors harder to pipeline? In what sense are some instructions "more complex" than others?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between vextracti128 and vextractf128?

Tags:

x86

avx

simd

avx2

user2813757

People also ask