Accessing arbitrary 16-bit elements packed in a 128-bit register

Question

With Intel compiler intrinsics, given a 128-bit register, packing 8 16-bit elements, how do I access (cheaply) arbitrary elements from within the register, for subsequent use of _mm_cvtepi8_epi64 (sign extend two 8-bit elements, packed at the lower 16 bits of the register, to two 64-bit elements)?

I'll explain why I ask:

Input: An in-memory buffer with k bytes, each either 0x0 or 0xff.
Desired output: For every two consecutive bytes of the input, a register packing two quad words (64-bits) with 0x0 and 0xffff ffff ffff ffff, respectively.
Ultimate goal: Sum a buffer of k doubles, masked according to the entries of the input buffer.

Note: The values 0x0 and 0xff of the input buffer may be changed to whatever is most helpful, provided that the effect of masking before the sum remains.

As may be apparent from my question, my current plan is as follows, streaming across the inputs buffers:

Extend the input mask buffer from 8-bits to 64-bits.
Mask the doubles buffer with the extended mask.
Sum the masked doubles.

Thanks, Asaf

Peter Cordes · Accepted Answer

Each byte is the mask for an entire double, so PMOVSXBQ does exactly what we need: load two bytes from a m16 pointer, and sign-extend them to the two 64bit (qword) halves of an xmm register.

# UNTESTED CODE
# (loop setup stuff)
# RSI: double pointer
# RDI: mask pointer
# RCX: loop conter = mask byte-count
    add   rdi, rcx
    lea   rsi, [rsi + rcx*8]  ; sizeof(double) = 8
    neg   rcx  ; point to the end and count up

    XORPS xmm0, xmm0  ; clear accumulator
      ; for real use: use multiple accumulators
      ; to hide ADDPD latency

ALIGN 16
.loop:
    PMOVSXBQ XMM1, [RDI + RCX]
    ANDPD    XMM1, [RSI + RCX * 8]
    ADDPD    XMM0, XMM1
    add      RCX, 2      ; 2 bytes / doubles per iter
    jl       .loop

    MOVHLPS  XMM1, XMM0    ; combine the two parallel sums
    ADDPD    XMM0, XMM1 
    ret

For real use, use multiple accumulators. Also see Micro fusion and addressing modes re: indexed addressing modes.

Writing this with intrinsics should be easy. As others have pointed out, just use dereferenced pointers as args to the intrinsics.

To answer the other part of your question, about how to shift data around to line it up for PMOVSX:

On Sandybridge and later, using PMOVSXBQ from RAM is probably good. On earlier CPUs that can't handle two loads per cycle, loading 16B of mask data at a time, and shifting it by 2 bytes at a time with PSRLDQ xmm1, 2 will put 2 bytes of mask data in the low 2 bytes of the register. Or maybe PUNPCKHQDQ, or PSHUFD to get two dependency chains going by moving the high 64 to the low 64 of another reg. You'd have to check on which port is used by which instruction (shift vs. shuffle/extract), and see which conflicts less with PMOVSX and ADDPD.

punpck and pshufd both use p1/p5 on SnB, so does pmovsx. addpd can only run on p1. andpd can only run on p5. Hmm, maybe PAND would be better, since it can run on p0 (and p1/p5). Otherwise nothing in the loop will be using execution port 0. If there's a latency penalty for moving data from integer to fp domains, it's unavoidable if we use PMOVSX, as that will get the mask data in the int domain. Better to use more accumulators to make the loop longer than the longest dependency chain. But keep it under 28uops or so to fit in the loop buffer, to make sure 4 uops can issue per cycle.

And more about optimizing the whole thing: Aligning the loop isn't really needed, since on nehalem and later it will fit in the loop buffer.

You should unroll the loop by 2 or 4, because pre-Haswell Intel CPUs don't have enough execution units to handle all 4 (fused) uops in a single cycle. (3 vector and one fused add/jl. The two loads fuse with the vector uops they're part of.) Sandybridge and later can execute both loads every cycle, so one iteration per cycle is doable, except loop overhead.

Oh, ADDPD has a latency of 3 cycles. So you need to unroll and use multiple accumulators to avoid the loop-carried dependency chain being the bottleneck. Probably unroll by 4, and then sum up the 4 accumulators at the end. You'll have to do that in the source code even with intrinsics, because that would change the order of operations for the FP math, so the compiler might not be willing to do that while unrolling.

So each unrolled-by-4 loop would take 4 clock cycles, plus 1 uop for the loop overhead. On Nehalem, where you have a tiny loop-cache but no uop cache, unrolling might mean you have to start caring about decoder throughput. On pre-sandybridge, though, one load per clock will probably be the bottleneck anyway.

For decoder throughput, you can probably use ANDPS instead of ANDPD, which takes one less byte to encode. IDK if that would help.

Widening this to 256b ymm registers would require AVX2 for the most straightforward implementation (for VPMOVSXBQ ymm). You might get a speedup on AVX-only by doing two VPMOVSXBQ xmm and combining them with VINSERTF128 or something.

FrankH. · Answer

Rather a tangent to the question itself, more filling in some information on the comments because the comment section itself is too small to hold this (sic !):

At least gcc can deal with the following code:

#include <smmintrin.h>

extern int fumble(__m128i x);

int main(int argc, char **argv)
{
    __m128i foo;
    __m128i* bar = (__m128i*)argv;

    foo = _mm_cvtepi8_epi64(*bar);

    return fumble(foo);
}

It turns this into the following assembly:

Disassembly of section .text.startup:

0000000000000000 :
   0:   66 0f 38 22 06          pmovsxbq (%rsi),%xmm0
   5:   e9 XX XX XX XX          jmpq   .....

This means that the intrinsics don't need to come in memory-argument form - the compiler handles dereferencing a mem argument transparently and uses the corresponding mem-operand instruction if possible. ICC does the same. I do not have a Windows machine / Visual C++ around to test whether MSVC does so as well, but I'd expect it to.

Paul R · Answer

Have you looked at _mm_extract_epi16 (PEXTRW) and _mm_insert_epi16 (PINSRW) ?

Accessing arbitrary 16-bit elements packed in a 128-bit register

Tags:

assembly

micro-optimization

simd

sse

intrinsics

Whaa

3 Answers

Peter Cordes

FrankH.

Paul R

Recent Activity

Donate For Us

Accessing arbitrary 16-bit elements packed in a 128-bit register

Tags:

assembly

micro-optimization

simd

sse

intrinsics

Whaa

3 Answers

Peter Cordes

FrankH.

Paul R

Related questions

Recent Activity

Donate For Us