Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the point of the VPERMILPS instruction (_mm_permute_ps)?

The AVX instruction set introduced VPERMILPS which seems to be a simplified version of SHUFPS (for the case where both input registers are the same).

For example, the following instruction:

c5 f0 c6 c1 00          vshufps xmm0,xmm1,xmm1,0x0

can be replaced with:

c4 e3 79 04 c1 00       vpermilps xmm0,xmm1,0x0

As you can see, the VPERMILPS version takes one byte extra and does the same thing. According to the instruction tables, both of the instructions take 1 CPU cycle and have the same throughput.

What's the point of introducing this kind of instruction? Am I missing something?

like image 771
Witek902 Avatar asked Jan 13 '19 12:01

Witek902


1 Answers

Yes using vpermilps-immediate is normally a missed-optimization vs. vshufps (except on Knight's Landing), wasting 1 byte of code size for the same operation with the same performance.


I think the main point of vpermilps is that it's available with a vector control operand. Before AVX, the only variable-control shuffle was integer pshufb.

VPERMILPS ymm1, ymm2, ymm3/m256 - Permute single-precision floating-point values in ymm2 using controls from ymm3/m256 and store result in ymm1.


But of course the immediate form has a totally separate opcode, and you're asking why it exists. Intel definitely could have included only the vector version, so the question becomes "why did they include the immediate version?" It takes at least a bit of extra decode hardware. The shuffle unit already has hardware to unpack immediate control operands in this form, because it's identical to vshufps, so perhaps it was cheap-ish to implement?

The only thing you can do with immediate vpermilps that you can't with vshufps is load+shuffle in one instruction, like vpermilps ymm0, [rdi], 0b00011011 to reverse the elements in each lane of the source. But like most instructions with an immediate, it can't micro-fuse a memory operand so it's still 2 fused-domain uops for the front end. (On AMD CPUs, it actually does save front-end bandwidth.) Still, it saves code-size vs. vmovups ymm0, [rdi] / vshufps ymm0,ymm0,ymm0, 0b00011011.

Other than that, I don't see much point. They both do the same shuffle in both 128-bit lanes, reusing the 4x 2-bit fields of the immediate for both lanes. (While vpermilpd and vshufpd both use 1-bit fields in their immediates, and can do different shuffles in each lane; the upper lane uses bits 2 and 3. And the ZMM versions use bits 4..7 for the upper 256. So again vpermilpd dst, src, imm is identical to vshufpd dst, src,src, imm, unless you use a memory source or you use a shuffle-control vector instead of immediate.)

It makes you wonder if Intel forgot that VEX encoding was going to enable non-destructive vshufps to do the same thing for immediate shuffles.


Or maybe they had in mind their low-power CPUs, like Knight's Landing (Xeon Phi), where a 1-source shuffle is cheaper:

vpermilps has 1-cycle throughput there, but vshufps or vperm2f128 has 2-cycle throughput and an extra cycle of latency. (According to Agner Fog's instruction tables.)

So using vshufps with the same input twice is slower there.

But on Intel's big-core mainstream CPUs, yes using vpermilps-immediate is a missed-optimization vs. vshufps, unless you can use it with a memory source. vshufps would need the same memory source twice, which obviously isn't encodeable.

AVX was designed years ahead of KNL, but maybe the ISA designers had in mind that maybe some future CPU could be more efficient with a simpler shuffle.

Regular Silvermont (out-of-order Atom that KNL is based on) doesn't support AVX, but it has 1 uop / 1-cycle throughput and latency for shufps. Goldmont has 0.5c throughput for shufps.

AFAIK, Intel still hasn't made a low-power core (other than Xeon Phi) with AVX. I don't think they're planning to with Tremont or Gracemont, successors to Goldmont Plus.

like image 51
Peter Cordes Avatar answered Nov 08 '22 11:11

Peter Cordes