Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SIMD string to unsigned int parsing in C# performance improvement

I've implemented a method for parsing an unsigned integer string of length <= 8 using SIMD intrinsics available in .NET as follows:

public unsafe static uint ParseUint(string text)
{
  fixed (char* c = text)
  {
    var parsed = Sse3.LoadDquVector128((byte*) c);
    var shift = (8 - text.Length) * 2;
    var shifted = Sse2.ShiftLeftLogical128BitLane(parsed, 
      (byte) (shift));

    Vector128<byte> digit0 = Vector128.Create((byte) '0');
    var reduced = Sse2.SubtractSaturate(shifted, digit0);

    var shortMult = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
    var collapsed2 = Sse2.MultiplyAddAdjacent(reduced.As<byte, short>(), shortMult);

    var repack = Sse41.PackUnsignedSaturate(collapsed2, collapsed2);
    var intMult = Vector128.Create((short)0, 0, 0, 0, 100, 1, 100, 1);
    var collapsed3 = Sse2.MultiplyAddAdjacent(repack.As<ushort,short>(), intMult);

    var e1 = collapsed3.GetElement(2);
    var e2 = collapsed3.GetElement(3);
    return (uint) (e1 * 10000 + e2);
  }
}

Sadly, a comparison with a baseline uint.Parse() gives the following, rather unimpressive, result:

Method Mean Error StdDev
Baseline 15.157 ns 0.0325 ns 0.0304 ns
ParseSimd 3.269 ns 0.0115 ns 0.0102 ns

What are some of the ways the above code can be improved? My particular areas of concern are:

  • The way a bit shift of the SIMD register happens with a calculation involving text.Length
  • ~~The unpacking of UTF-16 data using a MultiplyAddAdjacent involving a vector of 0s and 1~~
  • The way elements are extracted using GetElement() -- maybe there's some ToScalar() call that can happen somwehere?
like image 819
Dmitri Nesteruk Avatar asked Feb 25 '21 15:02

Dmitri Nesteruk


1 Answers

I've made some optimizations.

public unsafe uint ParseUint2(string text)
{
    fixed (char* c = text)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
        raw = Sse2.ShiftLeftLogical128BitLane(raw, (byte)(8 - text.Length << 1));
        Vector128<ushort> digit0 = Vector128.Create('0');
        raw = Sse2.SubtractSaturate(raw, digit0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        res = Ssse3.HorizontalAdd(res, res);
        res = Ssse3.HorizontalAdd(res, res);
        return (uint)res.GetElement(0);
    }
}

Reduced amount of type conversions and final calculations were made with vphaddd. As result it's by ~10% faster.

But...imm8 must be a compile-time constant. It means you can't use a variable where imm8 is argument. Otherwise JIT compiler won't produce the intrinsic instruction for the operation. It will make an external method call at this place (maybe some workaround is there). Thanks @PeterCordes for help.

This monster isn't significantly but faster than above one, regardless of text.Length.

public unsafe uint ParseUint3(string text)
{
    fixed (char* c = text)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
        switch (text.Length)
        {
            case 0: raw = Vector128<ushort>.Zero; break;
            case 1: raw = Sse2.ShiftLeftLogical128BitLane(raw, 14); break;
            case 2: raw = Sse2.ShiftLeftLogical128BitLane(raw, 12); break;
            case 3: raw = Sse2.ShiftLeftLogical128BitLane(raw, 10); break;
            case 4: raw = Sse2.ShiftLeftLogical128BitLane(raw, 8); break;
            case 5: raw = Sse2.ShiftLeftLogical128BitLane(raw, 6); break;
            case 6: raw = Sse2.ShiftLeftLogical128BitLane(raw, 4); break;
            case 7: raw = Sse2.ShiftLeftLogical128BitLane(raw, 2); break;
        };
        Vector128<ushort> digit0 = Vector128.Create('0');
        raw = Sse2.SubtractSaturate(raw, digit0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        res = Ssse3.HorizontalAdd(res, res);
        res = Ssse3.HorizontalAdd(res, res);
        return (uint)res.GetElement(0);
    }
}

Again, @PeterCordes doesn't allow me to write a slow code. The following version got 2 improvements. Now string loaded already shifted, and then subtracted to the shifted mask by the same offset. This avoids the slow fallback for ShiftLeftLogical128BitLane with a variable count.
The second improvement is replacing vphaddd with pshufd + paddd.

// Note that this loads up to 14 bytes before the data part of the string.  (Or 16 for an empty string)
// This might or might not make it possible to read from an unmapped page and fault, beware.
public unsafe uint ParseUint4(string text)
{
    const string mask = "\xffff\xffff\xffff\xffff\xffff\xffff\xffff\xffff00000000";
    fixed (char* c = text, m = mask)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c - 8 + text.Length);
        Vector128<ushort> mask0 = Sse3.LoadDquVector128((ushort*)m + text.Length);
        raw = Sse2.SubtractSaturate(raw, mask0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        Vector128<int> shuf = Sse2.Shuffle(res, 0x1b); // 0 1 2 3 => 3 2 1 0
        res = Sse2.Add(shuf, res);
        shuf = Sse2.Shuffle(res, 0x41); // 0 1 2 3 => 1 0 3 2
        res = Sse2.Add(shuf, res);
        return (uint)res.GetElement(0);
    }
}

~Twice faster than initial solution. (o_O) At least on my Haswell i7.

like image 84
aepot Avatar answered Oct 03 '22 10:10

aepot