In x86-64 SIMD instruction names, as well as the intrinsic functions you can use to access them from C/C++, you find both the terms shuffle (e.g., <code>_mm_shuffle_epi32</code>) and permute (e.g., <code>_mm_permute_pd</code>). Superficially, they seem to both be used for data movement stuff. What's the difference?

I haven't looked for inspiration outside of x86. I don't think there's any standard convention here. <hr> I think they just switched from "shuffle" to "permute" at some point in time between SSSE3 <code>pshufb</code> and AVX1 <code>vpermilps/pd</code> / <code>vperm2f128</code>. Everything before AVX is called "shuffle", everything after is called "permute". (SSE4.x didn't introduce any instructions named "shuffle" or "permute", just <code>pinsrd</code> / <code>pextrd</code> and other operand sizes being the main shuffles that SSE4.1 added) <hr> There are 2 exceptions to that, not counting VEX / EVEX encodings of <code>vshufps</code>, <code>vpshufd</code>, etc.: AVX512F <code>VSHUFF32X4</code> (and 64x2 and integer versions) 128-bit granularity lane shuffles with immediate control have the same design as <code>vshufps</code>: low half of the destination selects elements from the first source, high half selects from the 2nd source. e.g. <code>_mm512_shuffle_i64x2(__m512i a, __m512i b, int imm);</code> This naming is helpful to remember how the shuffle control works. With 4 output lanes, there's only room for 4x 2-bit selectors, not 4x 3-bit. The 256-bit operand-size version still has the same limitation so it only uses the low 2 bits of the immediate, like <code>shufpd</code>. AVX512BITALG <code>VPSHUFBITQMB</code> is like <code>vpmultishiftqb</code> (parallel bitfield extract) + vector->mask (like a movemask). So it can select any 8 bits within each qword chunk of the input. AVX512 256-bit granularity operations currently only exist with names like VEXTRACTF32x8 and VINSERTF32x8, not shuf or perm. <hr> Intrinsic name do match instruction mnemonics as far as shuffle vs. permute, but otherwise can leave out "in lane" when the mnemonic has it, requiring the lane-crossing version to be different too. (e.g. AVX1 <code>vpermilps</code> = <code>_mm_permute_ps</code> imm8 or <code>_mm_permutevar_ps</code> __m128i control vs. AVX2 <code>vpermps</code> = <code>_mm256_permutexvar_ps</code>; not available with immediate control, but vpermpd is. Intel's intrinsics guide only lists <code>_mm256_permutevar8x32_ps</code> for <code>vpermps</code>, while the ISA ref manual only lists <code>permutexvar</code>. I assume most compilers support the older <code>permutexvar</code> name. Anyway, weird choice, 8x32 sounds like an AVX512 instruction (with per-element masking); and maybe that's where that new intrinsic name came from. <hr> There are no other pattern I've noticed. We can easily rule out all of the following hypotheses: <ul> <li>copy-and-shuffle (<code>pshufd xmm, xmm/mem, imm</code>) vs. in-place shuffle (<code>pshufb data, idx</code> or <code>shufps xmm, xmm, imm</code>)</li> <li>immediate control vs. variable-control (pshufd vs. pshufb or AVX2 vpermd vs. vperm2i128)</li> <li>integer vs. FP (SSE2 pshufd vs. shufps/pd ; AVX2 vpermps vs. vpermd)</li> <li>1-source vs. 2-source (pshufd vs. shufps ; AVX2 vpermd vs. AVX512 vpermt2d)</li> <li>lane-crossing vs. in-lane (AVX1 <code>vpermilps</code> vs. AVX2 <code>vpermps</code>)</li> </ul> The shuffle-control immediate works the same way in <code>pshufd</code> and <code>vpermq</code>-immediate. But unlike the "tricky" <code>vshuff32x4</code> case, both <code>pshufd</code> and <code>vpermq</code> work the obvious way so no need to make an analogy to another mnemonic. Also, "pshuf" is a bit awkward vs. "shuf" or "perm", so I can see why they'd want something else for packed-integer. Note that the "shuf" names go all the way back to SSE1 <code>shufps</code>, introduced by Pentium III (Katmai) at the same time as MMX2 <code>pshufw mm, mm, imm8</code>. P5 Pentium MMX didn't have any instructions named shuf/perm instructions, just the <code>punpckl/h</code> shuffles in various sizes. https://nasm.us/doc/nasmdocb.html#section-B.1.7 (That NASM appendix is helpful because it sorts mnemonics into groups by order of introduction. It's what got me to notice the <code>vshuff32x4</code> mnemonic down in the AVX512 stuff after I thought they'd switched to calling everything "perm".)

What is the difference between shuffle and permute

Tags:

x86

naming

avx

intel

simd

In x86-64 SIMD instruction names, as well as the intrinsic functions you can use to access them from C/C++, you find both the terms shuffle (e.g., _mm_shuffle_epi32) and permute (e.g., _mm_permute_pd).

Superficially, they seem to both be used for data movement stuff. What's the difference?

454

asked Aug 15 '19 03:08

BeeOnRope

1 Answers

I haven't looked for inspiration outside of x86. I don't think there's any standard convention here.

I think they just switched from "shuffle" to "permute" at some point in time between SSSE3 pshufb and AVX1 vpermilps/pd / vperm2f128. Everything before AVX is called "shuffle", everything after is called "permute".

(SSE4.x didn't introduce any instructions named "shuffle" or "permute", just pinsrd / pextrd and other operand sizes being the main shuffles that SSE4.1 added)

There are 2 exceptions to that, not counting VEX / EVEX encodings of vshufps, vpshufd, etc.:

AVX512F VSHUFF32X4 (and 64x2 and integer versions) 128-bit granularity lane shuffles with immediate control have the same design as vshufps: low half of the destination selects elements from the first source, high half selects from the 2nd source. e.g. _mm512_shuffle_i64x2(__m512i a, __m512i b, int imm); This naming is helpful to remember how the shuffle control works. With 4 output lanes, there's only room for 4x 2-bit selectors, not 4x 3-bit. The 256-bit operand-size version still has the same limitation so it only uses the low 2 bits of the immediate, like shufpd.

AVX512BITALG VPSHUFBITQMB is like vpmultishiftqb (parallel bitfield extract) + vector->mask (like a movemask). So it can select any 8 bits within each qword chunk of the input.

AVX512 256-bit granularity operations currently only exist with names like VEXTRACTF32x8 and VINSERTF32x8, not shuf or perm.

Intrinsic name do match instruction mnemonics as far as shuffle vs. permute, but otherwise can leave out "in lane" when the mnemonic has it, requiring the lane-crossing version to be different too. (e.g. AVX1 vpermilps = _mm_permute_ps imm8 or _mm_permutevar_ps __m128i control vs. AVX2 vpermps = _mm256_permutexvar_ps; not available with immediate control, but vpermpd is.

Intel's intrinsics guide only lists _mm256_permutevar8x32_ps for vpermps, while the ISA ref manual only lists permutexvar. I assume most compilers support the older permutexvar name. Anyway, weird choice, 8x32 sounds like an AVX512 instruction (with per-element masking); and maybe that's where that new intrinsic name came from.

There are no other pattern I've noticed. We can easily rule out all of the following hypotheses:

copy-and-shuffle (pshufd xmm, xmm/mem, imm) vs. in-place shuffle (pshufb data, idx or shufps xmm, xmm, imm)
immediate control vs. variable-control (pshufd vs. pshufb or AVX2 vpermd vs. vperm2i128)
integer vs. FP (SSE2 pshufd vs. shufps/pd ; AVX2 vpermps vs. vpermd)
1-source vs. 2-source (pshufd vs. shufps ; AVX2 vpermd vs. AVX512 vpermt2d)
lane-crossing vs. in-lane (AVX1 vpermilps vs. AVX2 vpermps)

The shuffle-control immediate works the same way in pshufd and vpermq-immediate. But unlike the "tricky" vshuff32x4 case, both pshufd and vpermq work the obvious way so no need to make an analogy to another mnemonic. Also, "pshuf" is a bit awkward vs. "shuf" or "perm", so I can see why they'd want something else for packed-integer.

Note that the "shuf" names go all the way back to SSE1 shufps, introduced by Pentium III (Katmai) at the same time as MMX2 pshufw mm, mm, imm8.

P5 Pentium MMX didn't have any instructions named shuf/perm instructions, just the punpckl/h shuffles in various sizes.

https://nasm.us/doc/nasmdocb.html#section-B.1.7 (That NASM appendix is helpful because it sorts mnemonics into groups by order of introduction. It's what got me to notice the vshuff32x4 mnemonic down in the AVX512 stuff after I thought they'd switched to calling everything "perm".)

answered Oct 21 '22 09:10

Peter Cordes

Related questions
                            
                                masm error A2075: jump destination too far : by 30 bytes
                            
                                Can atomic instructions straddle cache lines?
                            
                                Cannot move 8 bit address to 16 bit register
                            
                                How safe is this swap w/o pushing registers?
                            
                                What does "interrupt hooking" mean?
                            
                                _mm_pause usage in gcc on Intel
                            
                                CPU Registers and Multitasking
                            
                                GCC generates useless instructions?
                            
                                How to increment a vector in AVX/AVX2
                            
                                What is the best way to clear the screen in 32-bit x86 assembly language (Video mode 13h)
                            
                                Referencing registers in machine code
                            
                                avx2 register bits reverse
                            
                                Is it allowed to access memory that spans the zero boundary in x86?
                            
                                How to move a floating-point constant value into an xmm register?
                            
                                Is it considered bad practice to use the flags register as a boolean return value?
                            
                                Are there advantages of using `setp` instead of `setb`?
                            
                                Can different CPUs on an x86 machine can have different local APIC register MMIO base addresses?
                            
                                Why does CX work in '[ ]' in 16 Bit Assembly when it is not a base or index register?
                            
                                ASM: too many memory references for `mov'
                            
                                How does the CPU know how many bytes it should read for the next instruction, considering instructions have different lenghts?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With