Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

_addcarry_u64 and _addcarryx_u64 with MSVC and ICC

MSVC and ICC both support the intrinsics _addcarry_u64 and _addcarryx_u64.

According to Intel's Intrinsic Guide and white paper these should map to adcx and adox respectively. However, by looking at the generated assembly it's clear they map to adc and adcx respectively and there is no intrinsic which maps to adox.

Additionally, telling the compiler to enable AVX2 with /arch:AVX2 in MSVC or -march=core-avx2 with ICC on Linux makes no difference. I'm not sure how to enable ADX with MSVC and ICC.

The documentation for MSVC lists _addcarryx_u64 with the technology of ADX whereas _addcarry_u64 has no listed technology. However, the link in MSVC's documentation for these intrinsics goes directly to the Intel Intrinsic guide which contradicts MSVC's own documentation and the generated assembly.

From this I conclude that Intel's Intrinsic guide and white paper are wrong.

This makes some sense for MSVC sense it does not allow inline assembly it should provide a way to use adc which it does with _addcarry_u64.

One of the big advantages of adcx and adox is that they operate on different flags (carry CF and overflow OF) and this allows two independent parallel carry chains. However, since there is no intrinsic for adox how is this possible? With ICC at least one can use inline assembly but this is not possible with MSVC in 64-bit mode.


Microsoft and Intel's documentation (both the white paper and the intrinsic guide online) both agree now.

The _addcarry_u64 intrinsic documentation says produces only adc. The _addcarryx_u64 intrinsic can produce either adcx or adox. With MSVC 2013 and 2015, however, _addcarryx_u64 only produces adcx. ICC produces both.

like image 393
Z boson Avatar asked Mar 24 '15 09:03

Z boson


2 Answers

They map to adc, adcx AND adox. The compiler decides which instructions to use, based on how you use them. If you perform two big-int additions in parallel the compiler will use adcx and adox, for higher throughput. For example:

unsigned char c1 = 0, c2 = 0
for(i=0; i< 100; i++){ 
    c1 = _addcarry_u64(c1, res[i], a[i], &res[i]);
    c2 = _addcarry_u64(c2, res[i], b[i], &res[i]);
}
like image 156
Vlad Krasnov Avatar answered Oct 15 '22 14:10

Vlad Krasnov


Related, GCC does not support ADOX and ADCX at the moment. "At the moment" includes GCC 6.4 (Fedora 25) and GCC 7.1 (Fedora 26). GCC effectively disabled the intrinsics, but it still advertises support by defining __ADX__ in the preprocessor. Also see Issue 67317, Silly code generation for _addcarry_u32/_addcarry_u64. Many thanks to Xi Ruoyao for finding the issue.

According to Uros Bizjak on the GCC Help mailing list, GCC may never support the intrinsics. Also see GCC does not generate ADCX or ADOX for _addcarryx_u64.

Clang has its own set of issues with respect to ADOX and ADCX. Clang 3.9 and 4.0 crash when attempting to use them. Also see Issue 34249, Panic when using _addcarryx_u64 with Clang 3.9. According to Craig Topper, it should be fixed in Clang 5.0.

My apologies for posting the information under a MSVC question. This is one of the few hits when searching for information about using the intrinsics.

like image 41
jww Avatar answered Oct 15 '22 14:10

jww