Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiply 64-bit integers using .NET Core's hardware intrinsics

I'm writing some performance sensitive code, where multiplication of unsigned 64-bit integers (ulong) is a bottleneck.

.NET Core 3.0 beings access to hardware intrinsics with the System.Runtime.Intrinsics namespace, which is fantastic.

I'm currently using a portable implementation that returns a tuple of the high and low bits of the 128-bit result:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static unsafe (ulong Hi, ulong Lo) Multiply64(ulong x, ulong y)
{
    ulong hi;
    ulong lo;

    lo = x * y;

    ulong x0 = (uint)x;
    ulong x1 = x >> 32;

    ulong y0 = (uint)y;
    ulong y1 = y >> 32;

    ulong p11 = x1 * y1;
    ulong p01 = x0 * y1;
    ulong p10 = x1 * y0;
    ulong p00 = x0 * y0;

    // 64-bit product + two 32-bit values
    ulong middle = p10 + (p00 >> 32) + (uint)p01;

    // 64-bit product + two 32-bit values
    hi = p11 + (middle >> 32) + (p01 >> 32);

    return (hi, lo);
}

I want to make this faster using intrinsics. I'm clear on how to use BMI2 when available (this is ~50% faster than the portable version):

ulong lo;
ulong hi = System.Runtime.Intrinsics.X86.Bmi2.X64.MultiplyNoFlags(x, y, &lo);
return (hi, lo);

I'm totally unclear on how to use the other intrinsics that are available; they all seem to rely on the Vector<128> type, and none of them seem to deal with the ulong type.

How can I implement multiplication of ulongs using SSE, AVX etc?

like image 593
Cocowalla Avatar asked May 07 '19 09:05

Cocowalla


1 Answers

SIMD vectors aren't single wide integers. The max element width is 64-bit. They're for processing multiple elements in parallel.

x86 doesn't have any instructions for 64x64 => 128-bit SIMD-element multiply, not even with AVX512DQ. (That does provide SIMD 64x64 => 64-bit multiply though, for 2, 4, or 8 elements in parallel.)

AVX512IFMA (in Cascade Lake) has 52-bit high and low-half multiply-accumulate (not a coincidence that's the significand width of double; SIMD integer multiply instructions use the same multiply hardware as FP).


So if you wanted 64x64 => 128-bit SIMD multiply, you'd have to synthesize it out of 4x 32x32 => 64-bit vpmuludq and some adds, including an add-width carry which you'd again have to synthesize from multiple instructions.

This would probably be slower than scalar mul r64 for an array of multiplies even with AVX512 available. It only takes 4 scalar mul instructions to produce 512 bits of multiply results, and modern x86 CPUs fully pipeline mul so they can produce 1 pair of results per clock. (Of course, store throughput is only 1 per clock until IceLake / Sunny Cove, so getting both halves of the 64-bit result stored is a problem! But moving data to XMM registers for 128-bit stores costs more uops and also hits a 64-bit-per-clock bottleneck.)

If you only needed 64x64 => 64-bit multiply, you could drop the high32*high32 multiply. I wrote a C++ version of that in Fastest way to multiply an array of int64_t? and it is barely faster than scalar on Haswell with AVX2, but significantly faster on Skylake. Either way, it would be nowhere near worth it without AVX2.


And BTW, you don't need BMI2 to do scalar 64x64 => 128-bit multiplies.

That's baseline for x86-64, with either one-operand mul (unsigned) or imul (signed). If C# exposes an intrinsic for BMI2 mulx, it surely must expose one for plain unsigned mul and signed imul which are at least as efficient in most cases (and smaller code size).

like image 147
Peter Cordes Avatar answered Oct 22 '22 15:10

Peter Cordes