In x86-64 SIMD instruction names, as well as the intrinsic functions you can use to access them from C/C++, you find both the terms shuffle (e.g., _mm_shuffle_epi32
) and permute (e.g., _mm_permute_pd
).
Superficially, they seem to both be used for data movement stuff. What's the difference?
shuffle. Modify a sequence in-place by shuffling its contents. This function only shuffles the array along the first axis of a multi-dimensional array.
Riffle seven times and you'll have a sufficiently random ordering of cards, an ordering that has likely never existed before. In other words, it's unlikely you'll ever shuffle two decks the same. The card shuffling result appears in a Numberphile video from 2015, along with a number of other card shuffling facts.
Random Permutations of Elements A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of [1, 2, 3] and vice-versa. The NumPy Random module provides two methods for this: shuffle() and permutation() .
sample() . random. shuffle() shuffles a list in place, and random. sample() returns a new randomized list.
I haven't looked for inspiration outside of x86. I don't think there's any standard convention here.
I think they just switched from "shuffle" to "permute" at some point in time between SSSE3 pshufb
and AVX1 vpermilps/pd
/ vperm2f128
. Everything before AVX is called "shuffle", everything after is called "permute".
(SSE4.x didn't introduce any instructions named "shuffle" or "permute", just pinsrd
/ pextrd
and other operand sizes being the main shuffles that SSE4.1 added)
There are 2 exceptions to that, not counting VEX / EVEX encodings of vshufps
, vpshufd
, etc.:
AVX512F VSHUFF32X4
(and 64x2 and integer versions) 128-bit granularity lane shuffles with immediate control have the same design as vshufps
: low half of the destination selects elements from the first source, high half selects from the 2nd source. e.g. _mm512_shuffle_i64x2(__m512i a, __m512i b, int imm);
This naming is helpful to remember how the shuffle control works. With 4 output lanes, there's only room for 4x 2-bit selectors, not 4x 3-bit. The 256-bit operand-size version still has the same limitation so it only uses the low 2 bits of the immediate, like shufpd
.
AVX512BITALG VPSHUFBITQMB
is like vpmultishiftqb
(parallel bitfield extract) + vector->mask (like a movemask). So it can select any 8 bits within each qword chunk of the input.
AVX512 256-bit granularity operations currently only exist with names like VEXTRACTF32x8 and VINSERTF32x8, not shuf or perm.
Intrinsic name do match instruction mnemonics as far as shuffle vs. permute, but otherwise can leave out "in lane" when the mnemonic has it, requiring the lane-crossing version to be different too. (e.g. AVX1 vpermilps
= _mm_permute_ps
imm8 or _mm_permutevar_ps
__m128i control vs. AVX2 vpermps
= _mm256_permutexvar_ps
; not available with immediate control, but vpermpd is.
Intel's intrinsics guide only lists _mm256_permutevar8x32_ps
for vpermps
, while the ISA ref manual only lists permutexvar
. I assume most compilers support the older permutexvar
name. Anyway, weird choice, 8x32 sounds like an AVX512 instruction (with per-element masking); and maybe that's where that new intrinsic name came from.
There are no other pattern I've noticed. We can easily rule out all of the following hypotheses:
pshufd xmm, xmm/mem, imm
) vs. in-place shuffle (pshufb data, idx
or shufps xmm, xmm, imm
)vpermilps
vs. AVX2 vpermps
)The shuffle-control immediate works the same way in pshufd
and vpermq
-immediate. But unlike the "tricky" vshuff32x4
case, both pshufd
and vpermq
work the obvious way so no need to make an analogy to another mnemonic. Also, "pshuf" is a bit awkward vs. "shuf" or "perm", so I can see why they'd want something else for packed-integer.
Note that the "shuf" names go all the way back to SSE1 shufps
, introduced by Pentium III (Katmai) at the same time as MMX2 pshufw mm, mm, imm8
.
P5 Pentium MMX didn't have any instructions named shuf/perm instructions, just the punpckl/h
shuffles in various sizes.
https://nasm.us/doc/nasmdocb.html#section-B.1.7 (That NASM appendix is helpful because it sorts mnemonics into groups by order of introduction. It's what got me to notice the vshuff32x4
mnemonic down in the AVX512 stuff after I thought they'd switched to calling everything "perm".)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With