Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an efficient way to get the first non-zero element in an SIMD register using SIMD intrinsics?

As the title reads, if a 256-bit SIMD register is:

0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

How can I efficiently get the index of the first non-zero element (i.e. the index 2 of the first 1)? The most straightforward way is to store into memory and check one by one, but it may cost to much. Is there any cute ideas to do so?

like image 278
MarZzz Avatar asked Oct 14 '16 00:10

MarZzz


People also ask

What are SIMD intrinsics?

#SIMD Intrinsics Intrinsics are just C-style functions that do something with these vector data types, usually by simply calling the associated assembly instruction. The main challenge of using SIMD is getting the data into contiguous fixed-sized blocks suitable for loading into registers.

What is __ m256d?

__m256d : This is a vector of four double precistion numbers (4x64 = 256 bits)

What are vector intrinsics?

Specific Intrinsics – Intrinsics that have a one-to-one mapping with a single assembly-language instruction. Generic Intrinsics – Intrinsics that map to one or more assembly-language instructions as a function of the type of input parameters.

Why is SIMD good?

Advantages. An application that may take advantage of SIMD is one where the same value is being added to (or subtracted from) a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image.


2 Answers

  • PCMPEQB/W/D/Q against an all-zero register to get a vector with elements that are all-1 for the zero elements and all-zero for the zero elements.
  • PMOVMSKB to turn the vector of all-ones or all-zero into an integer bitmask. (Or movmskps or pd to get 1 bit per dword or qword, instead of per byte, if that makes your bit-scan -> index calculation more efficient, like if you want an element offset instead of a byte offset.)
  • invert that (C ~ operator, asm NOT instruction) to get 1s in the bitmap for elements that were non-zero
  • TZCNT or BSF that integer to find the first (lowest) set bit. Beware of BSF's behaviour if its input is all-zero. But fortunately that's not a problem when the input is an int ~bitmask - the high 16 zero bits become 1s. (An AVX2 version of this with vpmovmskb ymm that filled a whole uint32_t with possibly-1 bits could use ~(uint64_t)bitmask, or just use tzcnt since AVX2 CPUs also have BMI1.)

For example with intrinsics:

int first_nonzero_byte(__m128i v){
    //__m128i v = _mm_loadu_si128((const __m128i*)p);  // for a pointer arg
    __m128i vcmp = _mm_cmpeq_epi8(v, _mm_setzero_si128());
    unsigned bitmask = _mm_movemask_epi8(vcmp);
#ifdef __GNUC__
    return __builtin_ctz(~bitmask);
#else
    return _tzcnt_u32( ~bitmask );
#endif
   // returns 16 if v was all zero so ~bitmask is 0xFFFF0000
}

Compiles on https://godbolt.org/z/Y8vYbsW69 to

# GCC11.2 -O3 -msse4.1
        movdqa  xmm1, xmm0      # missed optimization, should zero XMM1 instead
        pxor    xmm0, xmm0
        pcmpeqb xmm0, xmm1
        pmovmskb        eax, xmm0
        not     eax
        rep bsf eax, eax        # tzcnt on new CPUs, BSF on old
        ret

In GNU C where _tzcnt_u32 won't compile without -march=haswell or something, we use __builtin_ctz. As I said, ~bitmask is guaranteed to be non-zero. tzcnt is encoded as rep bsf; old CPUs will execute it as bsf, producing the same result for non-zero inputs. New CPUs will execute it as tzcnt, which is more efficient on AMD (2 uops instead of 7). Intel executes either as single-uop. GCC uses rep bsf aka tzcnt if you don't tell it a specific CPU to tune for.

For a related function like shown in JATothrim's answer, using only 4 single-uop instructions (actually 2 uops for tzcnt on AMD) instead of 8 instructions including a pblendvb (2 uops on Intel). The shuffle/horizontal-reduction idea in that answer could possibly be useful if you want the element index as a shuffle control vector for vpermilps, but seems sub-optimal vs. this when you actually want a scalar int.

int equal_first_dword_bitscan(__m128i x, __m128i y)
{
    __m128i vcmp = _mm_cmpeq_epi32(x,y);
    unsigned bitmask = _mm_movemask_ps(_mm_castsi128_ps(vcmp));
    bitmask |= 1<<4;    // return 4 if the low 4 bits are all 0
#ifdef __GNUC__
    return __builtin_ctz(bitmask);
#else
    return  _tzcnt_u32( bitmask );  // runs as BSF on old CPUs, don't skip the OR
#endif
}

MSVC doesn't have __builtin_ctz, but will compile _tzcnt_u32 even if you haven't told it the target CPU supports BMI1. If you're definitely only running on CPUs with BMI1, you can leave out the bitmask |= 1<<4; so it will return 32 for not-found.

If you use trailing-zero count in multiple functions, best to wrap that ifdef stuff in a helper function, instead of at each use-case.


If there's only one possible non-zero value (like 1), PCMPEQB against a vector of that so you don't need to invert it later.

If that's the case, consider storing your data in a bitmap in the first place, to decrease your cache footprint by a factor of 8. Then you just TZCNT 64-bit chunks of the array.

Or for a larger array of data, search for the first non-zero vector with SIMD, then TZCNT the first non-zero element of it, if you expect there to be multiple qwords of zeros before the first set bit. Like memcmp does for finding the mismatching byte position.
See Efficiently find least significant set bit in a large array? and How to find the first nonzero in an array efficiently?


BTW, the asm instruction ref manual lists the relevant C intrinsics at the bottom of each entry, and you can search Intel's intrinsics finder by asm mnemonic. (See the x86 tag wiki for links).

like image 118
Peter Cordes Avatar answered Oct 11 '22 12:10

Peter Cordes


I have been lately writing bunch of "get index of X" SIMD algorithms. So far most generic way to extract index out of e.g compare mask has been via horizontal indice minimum.

Here is (unsigned) integer horizontal minimum:

int horizontal_min(__m128i x) {
    x = _mm_min_epu32(x, _mm_shuffle_epi32(x, 0b01001110));
    x = _mm_min_epu32(x, _mm_shuffle_epi32(x, 0b11100001));
    return _mm_extract_epi32(x,0);
}

Now do following:

int equal_first(__m128i x, __m128i y) {
    const __m128i index = _mm_set_epi32(0,1,2,3);
    // Compute mask
    __m128i mask = _mm_cmpeq_epi32(x,y);
    // Select indices.
    mask = _mm_blendv_epi8(_mm_set1_epi32(-1), index, mask);
    // mask = index | (~mask);
    // pick smallest indice.
    return horizontal_min(mask);
}

The advantage of this code is that you don't need any bit scanning instructions and it is all done on FPU.

Tip: It becomes very efficient with 16-bit indices if you use phminposuw128 instruction to compute the minimum index.

EDIT: Peter's analysis pointed that my solution slower unless you need the result in SIMD register.

Another case is a reduction loop(s) where you want the index of said element in array. In loop, you accumulate the e.g. min/max element indices in SIMD register. The now unordered indices may point anywhere in the source array. Now you have to use horizontal_min() to tell where the min/max element was.

like image 43
JATothrim Avatar answered Oct 11 '22 12:10

JATothrim