Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use C# Vector<T> SIMD to find index of matching element

Using C#'s Vector<T>, how can we most efficiently vectorize the operation of finding the index of a particular element in a set?

As constraints, the set will always be a Span<T> of an integer primitive, and it will contain at most 1 matching element.

I have come up with a solution that seems alright, but I'm curious if we can do better. Here is the approach:

  1. Create a Vector<T> consisting only of the target element, in each slot.
  2. Use Vector.Equals() between the input set vector and the vector from the previous step, to get a mask containing a 1 in the single matching slot (or only 0s if there is no match).
  3. Using a pre-initialized vector containing 1-based indexes (1, 2, 3, 4, ...), call Vector.Dot() between that vector and the mask from the previous step. Each index will be multiplied by 0, except the potential matching index, which will be multiplied by 1. What we get back is the sum of those multiplications, which is either 0, or the 1-based index of the matching element.
  4. If the result was 0, return -1 for no match. Otherwise, subtract one from the result to make it 0-based, and return that.

        // One-time initialized vector containing { 1, 2, 3, 4, ... }
        Vector<ushort> indexes = MemoryMarshal.Cast<ushort, Vector<ushort>>(Enumerable.Range(1, Vector<ushort>.Count).Select(index => (ushort)index).ToArray())[0];
    
        // The input set and the element to search for
        Span<ushort> set = stackalloc ushort[]{ 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 };
        ushort element = 22;
    
        // Interpret input set as a sequence of vectors (set is assumed to have length power of two for brevity)
        var setVectors = MemoryMarshal.Cast<ushort, Vector<ushort>>(set);
    
        // Create a vector that contains the target element in each slot
        var elementVector = new Vector<ushort>(element);
    
        // Loop per vector rather than per element
        foreach (var vector in setVectors)
        {
            // Get a mask that has a 1 in the single matching slot, or only 0s
            var mask = Vector.Equals(vector, elementVector);
    
            // Get the dot product of the mask and the indexes
            // This will multiple each index by 0, or by 1 if it is the matching one, and return their sum, i.e. the matching index or 0
            // Note that the indexes are deliberately 1-based, to distinguished from 0 (no match)
            var index = Vector.Dot(indexes, mask);
    
            // Either return 0 for no match, or reduce the index by 1 to get the 0-based index
            return index == 0 ? -1 : index - 1;
        }
    
like image 343
Timo Avatar asked Oct 15 '22 13:10

Timo


2 Answers

As I can see the simple Span<char>.IndexOf already is using Intrinsics for searching a simple value. You don't even need to cast to char to use it, since MemoryExtensions.IndexOf only cares about size and Unsafe.SizeOf<ushort>() == sizeof(char) !

Also in JsonReaderHelper.IndexOfOrLessThan you will find a more complex Vectorization example for searching. It is using byte searching, but I am sure that you can adapt it to your needs if Span<ushort>.IndexOf is not suitable.

like image 185
Panos Theof Avatar answered Nov 05 '22 08:11

Panos Theof


The x86 asm you want the compiler to generate is compare-for-equal (pcmpeqb), pmovmskb or movmskps (vector to bitmask with 1-byte or 4-byte elements) and then if the mask is non-zero, bit-scan for first set bit (bsf or tzcnt).

That'll be more efficient than an integer dot product!!

You already have the compare-for-equal, and I think I've seen other C# Q&As with an intrinsic for vector->bitmap. If someone wants to edit this answer or post their own with C# that compiles / JITs to this asm, please do so. I don't know C#, I'm just here for the x86 SIMD.

like image 21
Peter Cordes Avatar answered Nov 05 '22 08:11

Peter Cordes