Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between vextracti128 and vextractf128?

Tags:

x86

avx

simd

avx2

vextracti128 and vextractf128 have the same functionality, parameters, and return values. In addition one is AVX instruction set while the other is AVX2. What is the difference?

like image 910
user2813757 Avatar asked Sep 25 '13 05:09

user2813757


People also ask

What does AVX stand for CPU?

Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD).

What is AVX used for?

Intel AVX is designed for use by applications that are strongly floating point compute intensive and can be vectorized. Example applications include audio processing and audio codecs, image and video editing applications, financial services analysis and modeling software, and manufacturing and engineering software.

What is the difference between AVX and AVX2?

The only difference between AVX and AVX2 for floating point code is availability of new FMA instruction – both AVX and AVX2 have 256-bit FP registers. The main advantage of new ISA of AVX2 is for integer code/data types – there you can expect up to 2x speedup, but 8% for FP code is good speedup of AVX2 over AVX.


1 Answers

vextracti128 and vextractf128 have not only the same functionality, parameters, and return values. They have the same instruction length. And they have the same throughput (according to Agner Fog's optimization manuals).

What is not completely clear is their latency values (performance in tight loops with dependency chains). Latency of instructions themselves is 3 cycles. But after reading section 2.1.3 ("Execution Engine") of Intel Optimization Manual we may suspect that vextracti128 should get additional 1 clock delay when working with floating point data and vextractf128 should get additional 1 clock delay when working with integer data. Measurements show that this is not true and latency always remains exactly 3 cycles (at least for Haswell processors). And as far as I know this is not documented anywhere in the Optimization Manual.

Still instruction set is only an interface to processor. Haswell is the only implementation of this interface containing both these instructions (for now). We could ignore the fact that implementations of these instructions are (most likely) identical. And use these instructions as intended - vextracti128 for integer data and vextractf128 for FP data. (If we only need to reorder data without performing any int/FP operations, the obvious choice is vextractf128 as it is supported by several older processors). Also experience shows that Intel sometimes decreases performance of some instructions in next generations of CPUs, so it would be wise to observe these instructions' affinity to avoid any possible speed degradation in the future.

Since Intel Optimization Manual is not very detailed describing relationship between int/FP domains for SIMD instructions, I've made some more measurements (on Haswell) and got some interesting results:


Shuffle instructions

There is no additional delay for any transitions between SSE integer and shuffle instructions. And there is no additional delay for any transitions between SSE FP and shuffle instructions. (Though I didn't test every instruction). For example you could insert such "obviously integer" instruction as pshufb between two FP instructions with no extra delay. Inserting shufpd in the middle of integer code also gives no extra delay.

Since vextracti128 and vextractf128 are executed by shuffle unit, they also have this "no delay" property.

This may be useful to optimize mixed int+FP code. If you need to reinterpret FP data as integers and at the same time shuffle the register, just make sure all FP instructions stand before the shuffle and all integer instructions are after it.


FP logical instructions

andps and other FP logical instructions also have the property of ignoring FP/int domains.

If you add integer logical instruction (like pand) into FP code, you get additional 2 cycle delay (one to get to int domain and other one to get back to FP). So the obvious choice for SIMD FP code is andps. The same andps may be used in the middle of integer code without any delays. Even better is to use such instructions right in between int and FP instructions. Interestingly, FP logical instructions are using the same port number 5 as all shuffle instructions.


Register access

Intel Optimization Manual describes bypass delays between producer and consumer micro-ops. But it does not say anything how micro-ops interact with registers.

This piece of code needs only 3 clocks per iteration (just as required by vaddps):

    vxorps ymm7, ymm7, ymm7
_benchloop:
    vaddps ymm0, ymm0, ymm7
    jmp _benchloop

But this one needs 2 clocks per iteration (1 more than needed for vpaddd):

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpaddd ymm0, ymm0, ymm7
    jmp _benchloop

The only difference here are calculations in integer domain instead of FP domain. To get 1 clock/iteration we need to add an instruction:

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpand ymm6, ymm7, ymm7
    vpaddd ymm0, ymm0, ymm6
    jmp _benchloop

Which hints that (1) all values stored in SIMD registers belong to FP domain, and (2) reading from SIMD register increases integer operation's latency by one. (The difference between {ymm0, ymm6} and ymm7 here is that ymm7 is stored in some scratch memory and works as real "register" while ymm0 and ymm6 are temporary and are represented by state of internal CPU's interconnections rather than some permanent storage, so ymm0 and ymm6 are not "read" but just passed between micro-ops).

like image 74
Evgeny Kluev Avatar answered Nov 01 '22 09:11

Evgeny Kluev