Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I instruct the MSVC compiler to use a 64bit/32bit division instead of the slower 128bit/64bit division?

How can I tell the MSVC compiler to use the 64bit/32bit division operation to compute the result of the following function for the x86-64 target:

#include <stdint.h> 

uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
  if (a > b)
        return ((uint64_t)b<<32) / a;   //Yes, this must be casted because the result of b<<32 is undefined
  else
        return uint32_t(-1);
}

I would like the code, when the if statement is true, to compile to use the 64bit/32bit division operation, e.g. something like this:

; Assume arguments on entry are: Dividend in EDX, Divisor in ECX
mov edx, edx  ;A dummy instruction to indicate that the dividend is already where it is supposed to be
xor eax,eax
div ecx   ; EAX = EDX:EAX / ECX

...however the x64 MSVC compiler insists on using the 128bit/64bit div instruction, such as:

mov     eax, edx
xor     edx, edx
shl     rax, 32                             ; Scale up the dividend
mov     ecx, ecx
div rcx   ;RAX = RDX:RAX / RCX

See: https://www.godbolt.org/z/VBK4R71

According to the answer to this question, the 128bit/64bit div instruction is not faster than the 64bit/32bit div instruction.

This is a problem because it unnecessarily slows down my DSP algorithm which makes millions of these scaled divisions.

I tested this optimization by patching the executable to use the 64bit/32bit div instruction: The performance increased 28% according to the two timestamps yielded by the rdtsc instructions.

(Editor's note: presumably on some recent Intel CPU. AMD CPUs don't need this micro-optimization, as explained in the linked Q&A.)

like image 829
George Robinson Avatar asked Dec 22 '22 22:12

George Robinson


2 Answers

No current compilers (gcc/clang/ICC/MSVC) will do this optimization from portable ISO C source, even if you let them prove that b < a so the quotient will fit in 32 bits. (For example with GNU C if(b>=a) __builtin_unreachable(); on Godbolt). This is a missed optimization; until that's fixed, you have to work around it with intrinsics or inline asm.

(Or use a GPU or SIMD instead; if you have the same divisor for many elements see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly.)


_udiv64 is available starting in Visual Studio 2019 RTM.

In C mode (-TC) it's apparently always defined. In C++ mode, you need to #include <immintrin.h>, as per the Microsoft docs. or intrin.h.

https://godbolt.org/z/vVZ25L (Or on Godbolt.ms because recent MSVC on the main Godbolt site is not working1.)

#include <stdint.h>
#include <immintrin.h>       // defines the prototype

// pre-condition: a > b else 64/32-bit division overflows
uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    uint32_t remainder;
    uint64_t d = ((uint64_t) b) << 32;
    return _udiv64(d, a, &remainder);
}

int main() {
    uint32_t c = ScaledDiv(5, 4);
    return c;
}

_udiv64 will produce 64/32 div. The two shifts left and right are a missed optimization.

;; MSVC 19.20 -O2 -TC
a$ = 8
b$ = 16
ScaledDiv PROC                                      ; COMDAT
        mov     edx, edx
        shl     rdx, 32                             ; 00000020H
        mov     rax, rdx
        shr     rdx, 32                             ; 00000020H
        div     ecx
        ret     0
ScaledDiv ENDP

main    PROC                                            ; COMDAT
        xor     eax, eax
        mov     edx, 4
        mov     ecx, 5
        div     ecx
        ret     0
main    ENDP

So we can see that MSVC doesn't do constant-propagation through _udiv64, even though in this case it doesn't overflow and it could have compiled main to just mov eax, 0ccccccccH / ret.


UPDATE #2 https://godbolt.org/z/n3Dyp- Added a solution with Intel C++ Compiler, but this is less efficient and will defeat constant-propagation because it's inline asm.

#include <stdio.h>
#include <stdint.h>

__declspec(regcall, naked) uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    __asm mov edx, eax
    __asm xor eax, eax
    __asm div ecx
    __asm ret
    // implicit return of EAX is supported by MSVC, and hopefully ICC
    // even when inlining + optimizing
}

int main()
{
    uint32_t a = 3 , b = 4, c = ScaledDiv(a, b);
    printf( "(%u << 32) / %u = %u\n", a, b, c);
    uint32_t d = ((uint64_t)a << 32) / b;
    printf( "(%u << 32) / %u = %u\n", a, b, d);
    return c != d;
}

Footnote 1: Matt Godbolt's main site's non-WINE MSVC compilers are temporarily(?) gone. Microsoft runs https://www.godbolt.ms/ to host the recent MSVC compilers on real Windows, and normally the main Godbolt.org site relayed to that for MSVC.)

It seems godbolt.ms will generate short links, but not expand them again! Full links are better anyway for their resistance to link-rot.

like image 184
Alex Lopatin Avatar answered Jan 29 '23 21:01

Alex Lopatin


@Alex Lopatin's answer shows how to use _udiv64 to get non-terrible scalar code (despite MSVC's stupid missed optimization shifting left/right).

For compilers that support GNU C inline asm (including ICC), you can use that instead of the inefficient MSVC inline asm syntax that has a lot of overhead for wrapping a single instruction. See What is the difference between 'asm', '__asm' and '__asm__'? for an example wrapping 64-bit / 32-bit => 32-bit idiv. (Use it for div by just changing the mnemonic and the types to unsigned.) GNU C doesn't have an intrinsic for 64 / 32 or 128 / 64 division; it's supposed to optimize pure C. But unfortunately GCC / Clang / ICC have missed optimizations for this case even using if(a<=b) __builtin_unreachable(); to promise that a>b.


But that's still scalar division, with pretty poor throughput.

Perhaps you can a GPU for your DSP task? If you have a large enough batch of work (and the rest of your algorithm is GPU-friendly) then it's probably worth the overhead of the communication round trip to the GPU.

If you're using the CPU, then anything we can suggest will benefit from parallelizing over multiple cores, so do that for more throughput.


x86 SIMD (SSE4/AVX2/AVX512*) doesn't have SIMD integer division in hardware. The Intel SVML functions _mm_div_epu64 and _mm256_div_epu64 are not intrinsics for a real instruction, they're slow functions that maybe unpack to scalar or compute multiplicative inverses. Or whatever other trick they use; possibly the 32-bit division functions convert to SIMD vectors of double, especially if AVX512 is available. (Intel still calls them "intrinsics" maybe because they're like built-in function that it understands and can do constant-propagation through. They're probably as efficient as they can be, but that's "not very", and they need to handle the general case, not just your special case with the low half of one divisor being all zero and the quotient fitting in 32 bits.)

If you have the same divisor for many elements, see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly. (You should adapt that technique to bake in the shifting of the dividend without actually doing it, leaving the all-zero low half implicit.)

If your divisor is always varying, and this isn't a middle step in some larger SIMD-friendly algorithm, scalar division may well be your best bet if you need exact results.


You could get big speedups from using SIMD float if 24-bit mantissa precision is sufficient

uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    return ((1ULL<<32) * (float)b) / a;
}

(float)(1ULL<<32) is a compile-time constant 4294967296.0f.

This does auto-vectorize over an array, with gcc and clang even without -ffast-math (but not MSVC). See it on Godbolt. You could port gcc or clang's asm back to intrinsics for MSVC; they use some FP tricks for packed-conversion of unsigned integers to/from float without AVX512. Non-vectorized scalar FP will probably be slower than plain integer on MSVC, as well as less accurate.

For example, Skylake's div r32 throughput is 1 per 6 cycles. But its AVX vdivps ymm throughput is one instruction (of 8 floats) per 5 cycles. Or for 128-bit SSE2, divps xmm has one per 3 cycle throughput. So you get about 10x the division throughput from AVX on Skylake. (8 * 6/5 = 9.6) Older microarchitectures have much slower SIMD FP division, but also somewhat slower integer division. In general the ratio is smaller because older CPUs don't have as wide SIMD dividers, so 256-bit vdivps has to run the 128-bit halves through separately. But there's still plenty of gain to be had, like better than a factor of 4 on Haswell. And Ryzen has vdivps ymm throughput of 6c, but div 32 throughput of 14-30 cycles. So that's an even bigger speedup than Skylake.

If the rest of your DSP task can benefit from SIMD, the overall speedup should be very good. float operations have higher latency, so out-of-order execution has to work harder to hide that latency and overlap execution of independent loop iterations. So IDK whether it would be better for you to just convert to float and back for this one operation, or to change your algorithm to work with float everywhere. It depends what else you need to do with your numbers.


If your unsigned numbers actually fit into signed 32-bit integers, you can use direct hardware support for packed SIMD int32 -> float conversion. Otherwise you need AVX512F for packed uint32 -> float with a single instruction, but that can be emulated with some loss of efficiency. That's what gcc/clang do when auto-vectorizing with AVX2, and why MSVC doesn't auto-vectorize.

MSVC does auto-vectorize with int32_t instead of uint32_t (and gcc/clang can make more efficient code), so prefer that if the highest bit of your integer inputs and/or outputs can't be set. (i.e. the 2's complement interpretation of their bit-patterns will be non-negative.)

With AVX especially, vdivps is slow enough to mostly hide the throughput costs of converting from integer and back, unless there's other useful work that could have overlapped instead.


Floating point precision:

A float stores numbers as significand * 2^exp where the significand is in the range [1.0, 2.0). (Or [0, 1.0) for subnormals). A single-precision float has 24-bits of significand precision, including the 1 implicit bit.

https://en.wikipedia.org/wiki/Single-precision_floating-point_format

So the 24 most-significant digits of an integer can be represented, the rest lost to rounding error. An integer like (uint64_t)b << 32 is no problem for float; that just means a larger exponent. The low bits are all zero.

For example, b = 123105810 gives us 528735427897589760 for b64 << 32. Converting that to float directly from 64-bit integer gives us 528735419307655168, a rounding error of 0.0000016%, or about 2^-25.8. That's unsurprising: the max rounding error is 0.5ulp (units in the last place), or 2^-25, and this number was even so it had 1 trailing zero anyway. That's the same relative error we'd get from converting 123105810; the resulting float is also the same except for its exponent field (which is higher by 32).

(I used https://www.h-schmidt.net/FloatConverter/IEEE754.html to check this.)

float's max exponent is large enough to hold integers outside the INT64_MIN to INT64_MAX range. The low bits of the large integers that float can represent are all zero, but that's exactly what you have with b<<32. So you're only losing the low 9 bits of b in the worst case where it's full-range and odd.

If the important part of your result is the most-significant bits, and having the low ~9 integer bits = rounding error is ok after converting back to integer, then float is perfect for you.

If float doesn't work, double may be an option.

divpd is about twice as slow as divps on many CPUs, and only does half as much work (2 double elements instead of 4 float). So you lose a factor of 4 throughput this way.

But every 32-bit integer can be represented exactly as a double. And by converting back with truncation towards zero, I think you get exact integer division for all pairs of inputs, unless double-rounding is a problem (first to nearest double, then truncation). You can test it with

// exactly correct for most inputs at least, maybe all.
uint32_t quotient = ((1ULL<<32) * (double)b) / a;

The unsigned long long constant (1ULL<<32) is converted to double, so you have 2x u32 -> double conversions (of a and b), a double multiply, a double divide, and a double -> u32 conversion. x86-64 can do all of these efficiently with scalar conversions (by zero extending uint32_t into int64_t, or ignoring the high bits of a double->int64_t conversion), but it will probably still be slower than div r32.

Converting u32 -> double and back (without AVX512) is maybe even more expensive that converting u32 -> float, but clang does auto-vectorize it. (Just change float to double in the godbolt link above). Again it would help a lot if your inputs were all <= INT32_MAX so they could be treated as signed integers for FP conversion.

If double-rounding is a problem, you could maybe set the FP rounding mode to truncation instead of the default round-to-nearest, if you don't use FP for anything else in the thread where your DSP code is running.

like image 30
Peter Cordes Avatar answered Jan 29 '23 22:01

Peter Cordes