Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the best instruction sequences to generate vector constants on the fly?

"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count.

Constant-generation is by its very nature the start of a fresh dependency chain, so it's unusual for latency to matter. It's also unusual to generate constants inside a loop, so throughput and execution-port demands are also mostly irrelevant.

Generating constants instead of loading them takes more instructions (except for all-zero or all-one), so it does consume precious uop-cache space. This can be an even more limited resource than data cache.

Agner Fog's excellent Optimizing Assembly guide covers this in Section 13.4. Table 13.10 has sequences for generating vectors where every element is 0, 1, 2, 3, 4, -1, or -2, with element sizes from 8 to 64 bits. Table 13.11 has sequences for generating some floating point values (0.0, 0.5, 1.0, 1.5, 2.0, -2.0, and bitmasks for the sign bit.)

Agner Fog's sequences only use SSE2, either by design or because it hasn't been updated for a while.

What other constants can be generated with short non-obvious sequences of instructions? (Further extensions with different shift counts are obvious and not "interesting".) Are there better sequences for generating the constants Agner Fog does list?

How to move 128-bit immediates to XMM registers illustrates some ways to put an arbitrary 128b constant into the instruction stream, but that's usually not sensible (it doesn't save any space, and takes up lots of uop-cache space.)

like image 477
Peter Cordes Avatar asked Jan 29 '16 12:01

Peter Cordes


1 Answers

All-zero: pxor xmm0,xmm0 (or xorps xmm0,xmm0, one instruction-byte shorter.) There isn't much difference on modern CPUs, but on Nehalem (before xor-zero elimination), the xorps uop could only run on port 5. I think that's why compilers favour pxor-zeroing even for registers that will be used with FP instructions.

All-ones: pcmpeqw xmm0,xmm0. This is the usual starting point for generating other constants, because (like pxor) it breaks the dependency on the previous value of the register (except on old CPUs like K10 and pre-Core2 P6).

There's no advantage to the W version over the byte or dword element size versions of pcmpeq on any CPU in Agner Fog's instruction tables, but pcmpeqQ takes an extra byte, is slower on Silvermont, and requires SSE4.1.

SO doesn't really have table formatting, so I'm just going to list additions to Agner Fog's table 13.10, rather than an improved version. Sorry. Maybe if this answer becomes popular, I'll use an ascii-art table-generator, but hopefully improvements will be rolled into future versions of the guide.


The main difficulty is 8-bit vectors, because there's no PSLLB

Agner Fog's table generates vectors of 16-bit elements and uses packuswb to work around this. For example, pcmpeqw xmm0,xmm0 / psrlw xmm0,15 / psllw xmm0,1 / packuswb xmm0,xmm0 generates a vector where every byte is 2. (This pattern of shifts, with different counts, is the main way to produce most constants for wider vectors). There is a better way:

paddb xmm0,xmm0 (SSE2) works as a left-shift by one with byte granularity, so a vector of -2 bytes can be generated with only two instructions (pcmpeqw / paddb). paddw/d/q as a left-shift-by-one for other element sizes saves one byte of machine code compared to shifts, and can generally run on more ports than a shift-imm.

pabsb xmm0,xmm0 (SSSE3) turns a vector of all-ones (-1) into a vector of 1 bytes, and is non-destructive so you still have the set1(-1) vector.

(You sometimes don't need set1(1). You can add 1 to every element by subtracting -1 with psubb instead.)

We can generate 2 bytes with pcmpeqw / paddb / pabsb. (Order of add vs. abs doesn't matter). pabs doesn't need an imm8, but only saves code bytes for other element widths vs. right shifting when both require a 3-byte VEX prefix. This only happens when the source register is xmm8-15. (vpabsb/w/d always requires a 3-byte VEX prefix for VEX.128.66.0F38.WIG, but vpsrlw dest,src,imm can otherwise use a 2-byte VEX prefix for its VEX.NDD.128.66.0F.WIG).

We can actually save instructions in generating 4 bytes, too: pcmpeqw / pabsb / psllw xmm0, 2. All the bits that are shifted across byte boundaries by the word-shift are zero, thanks to pabsb. Obviously other shift counts can put the single set-bit at other locations, including the sign bit to generate a vector of -128 (0x80) bytes. Note that pabsb is non-destructive (the destination operand is write-only, and doesn't need to be the same as the source to get the desired behaviour). You can keep the all-ones around as a constant, or as the start of generating another constant, or as a source operand for psubb (to increment by one).

A vector of 0x80 bytes can be also (see prev paragraph) be generated from anything that saturates to -128, using packsswb. e.g. if you already have a vector of 0xFF00 for something else, just copy it and use packsswb. Constants loaded from memory that happen to saturate correctly are potential targets for this.

A vector of 0x7f bytes can be generated with pcmpeqw / psrlw xmm0, 9 / packuswb xmm0,xmm0. I'm counting this as "non obvious" because the mostly-set nature didn't make me think of just generating it as a value in each word and doing the usual packuswb.

pavgb (SSE2) against a zeroed register can right-shift by one, but only if the value is even. (It does unsigned dst = (dst+src+1)>>1 for rounding, with 9-bit internal precision for the temporary.) This doesn't seem to be useful for constant-generation, though, because 0xff is odd: pxor xmm1,xmm1 / pcmpeqw xmm0,xmm0 / paddb xmm0,xmm0 / pavgb xmm0, xmm1 produces 0x7f bytes with one more insn than shift/pack. If a zeroed register is already needed for something else, though, paddb / pavgb does save one instruction byte.


I have tested these sequences. The easiest way is to throw them in a .asm, assemble/link, and run gdb on it. layout asm, display /x $xmm0.v16_int8 to dump that after every single-step, and single-step instructions (ni or si). In layout reg mode, you can do tui reg vec to switch to a display of vector regs, but it's nearly useless because you can't select which interpretation to display (you always get all of them, and can't hscroll, and the columns don't line up between registers). It's excellent for integer regs/flags, though.


Note that using these with intrinsics can be tricky. Compilers don't like to operate on uninitialized variables, so you should use _mm_undefined_si128() to tell the compiler that's what you meant. Or perhaps using _mm_set1_epi32(-1) will get your compiler to emit a pcmpeqd same,same. Without this, some compilers will xor-zero uninitialized vector variables before use, or even (MSVC) load uninitialized memory from the stack.


Many constants can be stored more compactly in memory by taking advantage of SSE4.1's pmovzx or pmovsx for zero or sign-extension on the fly. For example, a 128b vector of {1, 2, 3, 4} as 32bit elements could be generated with a pmovzx load from a 32bit memory location. Memory operands can micro-fuse with pmovzx, so it doesn't take any extra fused-domain uops. It does prevent using the constant directly as a memory operand, though.

C/C++ intrinsics support for using pmovz/sx as a load is terrible: there's _mm_cvtepu8_epi32 (__m128i a), but no version that takes a uint32_t * pointer operand. You can hack around it, but it's ugly and compiler optimization failure is a problem. See the linked question for details and links to the gcc bug reports.

With 256b and (not so) soon 512b constants, the savings in memory are larger. This only matters very much if multiple useful constants can share a cache-line, though.

The FP equivalent of this is VCVTPH2PS xmm1, xmm2/m64, requiring the F16C (half precision) feature flag. (There's also a store instruction that packs single to half, but no computation at half precision. It's a memory bandwidth / cache footprint optimization only.)


Obviously when all elements are the same (but not suitable for generating on the fly), pshufd or AVX vbroadcastps / AVX2 vpbroadcastb/w/d/q/i128 are useful. pshufd can take a memory source operand, but it has to be 128b. movddup (SSE3) does a 64bit load, broadcast to fill a 128b register. On Intel, it doesn't need an ALU execution unit, only load port. (Similarly, AVX v[p]broadcast loads of dword size and larger are handled in the load unit, without ALU).

Broadcasts or pmovz/sx are excellent for saving executable size when you're going to load a mask into a register for repeated use in a loop. Generating multiple similar masks from one starting point can also save space, if it only takes one instruction.

See also For for an SSE vector that has all the same components, generate on the fly or precompute? which is asking more about using the set1 intrinsic, and it isn't clear if it's asking about constants or broadcasts of variables.

I also experimented some with compiler output for broadcasts.


If cache misses are a problem, take a look at your code and see if the compiler has duplicated _mm_set constants when the same function is inlined into different callers. Also watch out for constants that are used together (e.g. in functions called one after another) being scattered into different cache lines. Many scattered loads for constants is far worse than loading a lot of constants all from near each other.

pmovzx and/or broadcast loads let you pack more constants into a cache line, with very low overhead for loading them into a register. The load won't be on the critical path, so even if it takes an extra uop, it can take a free execution unit at any cycle over a long window.

clang actually does a good job of this: separate set1 constants in different functions are recognized as identical, the way identical string literals can be merged. Note that clang's asm source output appears to show each function having its own copy of the constant, but the binary disassembly shows that all those RIP-relative effective addresses are referencing the same location. For 256b versions of the repeated functions, clang also uses vbroadcastsd to only require an 8B load, at the expense of an extra instruction in each function. (This is at -O3, so clearly the clang devs have realized that size matters for performance, not just for -Os). IDK why it doesn't go down to a 4B constant with vbroadcastss, because that should be just as fast. Unfortunately, the vbroadcast don't simply come from part of the 16B constant the other functions used. This maybe makes sense: an AVX version of something could probably only merge some of its constants with an SSE version. It's better to leave the memory pages with SSE constants completely cold, and have the AVX version keep all its constants together. Also, it's a harder pattern-matching problem to be handled at assemble or link time (however it's done. I didn't read every directive to figure out which one enables the merging.)

gcc 5.3 also merges constants, but doesn't use broadcast-loads to compress 32B constants. Again the 16B constant doesn't overlap with the 32B constant.

like image 104
Peter Cordes Avatar answered Sep 17 '22 13:09

Peter Cordes