Where is VPERMB in AVX2?

Tags:

AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of 32-bit values into another, with the permutation selectable at runtime¹. Functionally, that obsoletes a whole slew of existing old unpack, broadcast, permute, shuffle and shift instructions³.

Cool beans.

So where is VPERMB? I.e., the same instruction, but working on byte-sized elements. Or, for that matter, where is VPERMW, for 16-bit elements? Having dabbled in x86 assembly for some time, it is pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time. It can do any possible permutation, broadcast or byte-wise shuffle. Furthermore, it can also be used to do 16 parallel 4-bit -> 8-bit table lookups².

Unfortunately, PSHUFB wasn't extended to be cross-lane in AVX2, so it is restricted to within-lane behavior. The VPERM instructions are able to do cross shuffle (in fact, "perm" and "shuf" seem to be synonyms in the instruction mnemonics?) - but the 8 and 16-bit versions were omitted?

There doesn't even seem to be a good way to emulate this instruction, whereas you can easily emulate the larger-width shuffles with smaller-width ones (often, it's even free: you just need a different mask).

I have no doubt that Intel is aware of the wide and heavy use of PSHUFB, so the question naturally arises as to why the byte variant was omitted in AVX2. Is the operation intrinsically harder to implement in hardware? Are there encoding restrictions forcing its omission?

¹By selectable at runtime, I mean that the mask that defines the shuffling behavior comes from a register. This makes the instruction an order of magnitude more flexible than the earlier variants that take an immediate shuffle mask, in the same way that add is more flexible than inc or a variable shift is more flexible than an immediate shift.

²Or 32 such lookups in AVX2.

³The older instructions are occasionally useful if they have a shorter encoding, or avoid loading a mask from memory, but functionally they are superseded.

525

asked Jun 23 '16 00:06

BeeOnRope

1 Answers

I'm 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn't exist is that the implementation cost must outweigh the significant benefit.

Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field that represents combinations of prefixes isn't a bit-field, it's an integer with most of the values unused.

They decided to implement it for AVX512VBMI, though, with larger element sizes available in AVX512BW and AVX512F. Maybe they realized how much it sucked to not have this, and decided to do it anyway. AVX512F takes a lot of die area / transistors to implement, so much that Intel decided not to even implement it in retail desktop CPUs for a couple generations.

(Part of that is that I think these days a lot of code that can take advantage of brand new instruction sets is written to run on known servers, instead of runtime dispatching for use on client machines).

According to Wikipedia, AVX512VBMI isn't coming until Cannonlake, but then we will have vpermi2b, which does 64 parallel table lookups from a 128B table (2 zmm vectors)). Skylake Xeon will only bring vpermi2w and larger element sizes (AVX512F + AVX512BW).

I'm pretty sure that thirty two 32:1 muxers are a lot more expensive than eight 8:1 muxers, even if the 8:1 muxers are 4x wider. They could implement it with multiple stages of shuffling (rather than a single 32:1 stage), since lane-crossing shuffles get a 3-cycle time budget to get their work done. But still a lot of transistors.

I'd love to see a less hand-wavy answer from someone with hardware design experience. I built a digital timer from TTL counter chips on a breadboard once (and IIRC, read out the counter from BASIC on a TI-99/4A which was very obsolete even ~20 years ago whe), but that's about it.

It's pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time.

Yup. It was the first variable-shuffle, with a control mask from a register instead of an immediate. Looking up a shuffle mask from a LUT of shuffle masks based on a pcmpeqb / pmovmskb result can do some crazy powerful things. @stgatilov's IPv4 dotted-quad -> int converter is one of my favourite examples of awesome SIMD tricks.

199

answered Oct 07 '22 06:10

Peter Cordes

Related questions
                            
                                How can I perform 64-bit division with a 32-bit divide instruction?
                            
                                What bytecode library when controlling line numbers?
                            
                                ASM call conventions
                            
                                Iterating through and modifying a string in MIPS
                            
                                How can I create a parallel stack and run a coroutine on it?
                            
                                x86 assembly instruction: call *Reg
                            
                                Int to Float to Int conversion precision loss
                            
                                Waiting for a change on $D012 (C64 assembler)
                            
                                Why doesn't time() from time.h have a syscall to sys_time?
                            
                                Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?
                            
                                Defining Bytes in GCC Inline Assembly in Dev-C++(.ascii in AT&T syntax on Windows)
                            
                                Is it possible to use SSE and SSE2 to make a 128-bit wide integer?
                            
                                How to convert 32-bit compiled binary to 64-bit [closed]
                            
                                Difference between PREFETCH and PREFETCHNTA instructions
                            
                                How can I write a "Hello World" app in assembly language? [duplicate]
                            
                                Tools required to learn ARM on linux x86 platform [closed]
                            
                                How can I use data discovered via a memory scanner in an external program?
                            
                                Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?
                            
                                Buffer overflow appeared before it is expected
                            
                                Load from a 64-bit address into other register than rax

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where is VPERMB in AVX2?

Tags:

x86

assembly

intel

sse

avx2

BeeOnRope

People also ask

1 Answers

Peter Cordes

Recent Activity

Donate For Us