SHLD/SHRD instructions are assembly instructions to implement multiprecisions shifts.
Consider the following problem:
uint64_t array[4] = {/*something*/};
left_shift(array, 172);
right_shift(array, 172);
What is the most efficient way to implement left_shift
and right_shift
, two functions that operates a shift on an array of four 64-bit unsigned integer as if it was a big 256 bits unsigned integer?
Is the most efficient way of doing that is by using SHLD/SHRD instructions, or is there better (like SIMD versions) instructions on modern architecture?
In this answer I'm only going to talk about x64.
x86 has been outdated for 15 years now if you're coding in 2016 it hardly makes sense to be stuck in 2000.
All times are according to Agner Fog's instruction tables.
Intel Skylake example timings*
The shld
/shrd
instructions are rather slow on x64.
Even on Intel skylake they have a latency of 4 cycles and uses 4 uops meaning it uses up a lot of execution units, on older processors they're even slower.
I'm going to assume you want to shift by a variable amount, which means a
SHLD RAX,RDX,cl 4 uops, 4 cycle latency. -> 1/16 per bit
Using 2 shifts + add you can do this faster slower.
@Init:
MOV R15,-1
SHR R15,cl //mask for later use.
@Work:
SHL RAX,cl 3 uops, 2 cycle latency
ROL RDX,cl 3 uops, 2 cycle latency
AND RDX,R15 1 uops, 0.25 latency
OR RAX,RDX 1 uops, 0.25 latency
//Still needs unrolling to achieve least amount of slowness.
Note that this only shifts 64 bits because RDX is not affected.
So you're trying to beat 4 cycles per 64 bits.
//4*64 bits parallel shift.
//Shifts in zeros.
VPSLLVQ YMM2, YMM2, YMM3 1uop, 0.5 cycle latency.
However if you want it to do exactly what SHLD does you'll need to use an extra VPSLRVQ and an OR to combine the two results.
VPSLLVQ YMM1, YMM2, YMM3 1uop, 0.5 cycle latency.
VPSRLVQ YMM5, YMM2, YMM4 1uop, 0.5 cycle latency.
VPOR YMM1, YMM1, YMM5 1uop, 0.33 cycle latency.
You'll need to interleave 4 sets of these costing you (3*4)+2=14 YMM registers.
Doing so I doubt you'll profit from the low .33 latency of VPADDQ so I'll assume a 0.5 latency instead.
That makes 3uops, 1.5 cycle latency for 256 bits = 1/171 per bit = 0.37 cycle per QWord = 10x faster, not bad.
If you are able to get 1.33 cycle per 256 bits = 1/192 per bit = 0.33 cycle per QWord = 12x faster.
'It’s the Memory, Stupid!'
Obviously I've not added in loop overhead and load/stores to/from memory.
The loop overhead is tiny given proper alignment of jump targets, but the memory
access will easily be the biggest slowdown.
A single cache miss to main memory on Skylake can cost you more than 250 cycles1.
It is in clever management of memory that the major gains will be made.
The 12 times possible speed-up using AVX256 is small potatoes in comparison.
I'm not counting the set up of the shift counter in CL
/(YMM3/YMM4)
because I'm assuming you'll reuse that value over many iterations.
You're not going to beat that with AVX512 instructions, because consumer grade CPU's with AVX512 instructions are not yet available.
The only current processor that supports currently is Knights Landing.
*) All these timings are best case values, and should be taken as indications, not as hard values.
1) Cost of cache miss in Skylake: 42 cycles + 52ns = 42 + (52*4.6Ghz) = 281 cycles.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With