With Intel compiler intrinsics, given a 128-bit register, packing 8 16-bit elements, how do I access (cheaply) arbitrary elements from within the register, for subsequent use of _mm_cvtepi8_epi64
(sign extend two 8-bit elements, packed at the lower 16 bits of the register, to two 64-bit elements)?
I'll explain why I ask:
0x0
and 0xffff ffff ffff ffff
, respectively.Note: The values 0x0
and 0xff
of the input buffer may be changed to whatever is most helpful, provided that the effect of masking before the sum remains.
As may be apparent from my question, my current plan is as follows, streaming across the inputs buffers:
Thanks, Asaf
Each byte is the mask for an entire double, so PMOVSXBQ
does exactly what we need: load two bytes from a m16
pointer, and sign-extend them to the two 64bit (qword) halves of an xmm register.
# UNTESTED CODE
# (loop setup stuff)
# RSI: double pointer
# RDI: mask pointer
# RCX: loop conter = mask byte-count
add rdi, rcx
lea rsi, [rsi + rcx*8] ; sizeof(double) = 8
neg rcx ; point to the end and count up
XORPS xmm0, xmm0 ; clear accumulator
; for real use: use multiple accumulators
; to hide ADDPD latency
ALIGN 16
.loop:
PMOVSXBQ XMM1, [RDI + RCX]
ANDPD XMM1, [RSI + RCX * 8]
ADDPD XMM0, XMM1
add RCX, 2 ; 2 bytes / doubles per iter
jl .loop
MOVHLPS XMM1, XMM0 ; combine the two parallel sums
ADDPD XMM0, XMM1
ret
For real use, use multiple accumulators. Also see Micro fusion and addressing modes re: indexed addressing modes.
Writing this with intrinsics should be easy. As others have pointed out, just use dereferenced pointers as args to the intrinsics.
To answer the other part of your question, about how to shift data around to line it up for PMOVSX
:
On Sandybridge and later, using PMOVSXBQ from RAM is probably good. On earlier CPUs that can't handle two loads per cycle, loading 16B of mask data at a time, and shifting it by 2 bytes at a time with PSRLDQ xmm1, 2
will put 2 bytes of mask data in the low 2 bytes of the register. Or maybe PUNPCKHQDQ
, or PSHUFD
to get two dependency chains going by moving the high 64 to the low 64 of another reg. You'd have to check on which port is used by which instruction (shift vs. shuffle/extract), and see which conflicts less with PMOVSX
and ADDPD
.
punpck
and pshufd
both use p1/p5 on SnB, so does pmovsx
. addpd
can only run on p1. andpd
can only run on p5. Hmm, maybe PAND
would be better, since it can run on p0 (and p1/p5). Otherwise nothing in the loop will be using execution port 0. If there's a latency penalty for moving data from integer to fp domains, it's unavoidable if we use PMOVSX
, as that will get the mask data in the int domain. Better to use more accumulators to make the loop longer than the longest dependency chain. But keep it under 28uops or so to fit in the loop buffer, to make sure 4 uops can issue per cycle.
And more about optimizing the whole thing: Aligning the loop isn't really needed, since on nehalem and later it will fit in the loop buffer.
You should unroll the loop by 2 or 4, because pre-Haswell Intel CPUs don't have enough execution units to handle all 4 (fused) uops in a single cycle. (3 vector and one fused add
/jl
. The two loads fuse with the vector uops they're part of.) Sandybridge and later can execute both loads every cycle, so one iteration per cycle is doable, except loop overhead.
Oh, ADDPD
has a latency of 3 cycles. So you need to unroll and use multiple accumulators to avoid the loop-carried dependency chain being the bottleneck. Probably unroll by 4, and then sum up the 4 accumulators at the end. You'll have to do that in the source code even with intrinsics, because that would change the order of operations for the FP math, so the compiler might not be willing to do that while unrolling.
So each unrolled-by-4 loop would take 4 clock cycles, plus 1 uop for the loop overhead. On Nehalem, where you have a tiny loop-cache but no uop cache, unrolling might mean you have to start caring about decoder throughput. On pre-sandybridge, though, one load per clock will probably be the bottleneck anyway.
For decoder throughput, you can probably use ANDPS
instead of ANDPD
, which takes one less byte to encode. IDK if that would help.
Widening this to 256b ymm
registers would require AVX2 for the most straightforward implementation (for VPMOVSXBQ ymm
). You might get a speedup on AVX-only by doing two VPMOVSXBQ xmm
and combining them with VINSERTF128
or something.
Rather a tangent to the question itself, more filling in some information on the comments because the comment section itself is too small to hold this (sic !):
At least gcc can deal with the following code:
#include <smmintrin.h>
extern int fumble(__m128i x);
int main(int argc, char **argv)
{
__m128i foo;
__m128i* bar = (__m128i*)argv;
foo = _mm_cvtepi8_epi64(*bar);
return fumble(foo);
}
It turns this into the following assembly:
Disassembly of section .text.startup: 0000000000000000 : 0: 66 0f 38 22 06 pmovsxbq (%rsi),%xmm0 5: e9 XX XX XX XX jmpq .....
This means that the intrinsics don't need to come in memory-argument form - the compiler handles dereferencing a mem argument transparently and uses the corresponding mem-operand instruction if possible. ICC does the same. I do not have a Windows machine / Visual C++ around to test whether MSVC does so as well, but I'd expect it to.
Have you looked at _mm_extract_epi16 (PEXTRW) and _mm_insert_epi16 (PINSRW) ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With