How to use processor instructions in C++ to implement fast arithmetic operations

Question

I was working on the C++ implementation of Shamir's secret sharing scheme. I split the message into 8-bit chunks and on each performs corresponding arithmetic. The underlying finite field was Rijndael's finite field F_256 / (x^8 + x^4 + x^3 + x + 1).

I made a quick search if there is some well-known and spread library for Rijndael's finite field calculations (e. g. OpenSSL or similar), and didn't find any. So I implemented it from scratch, partly as a programming exercise. A few days ago, however, a professor at our university mentioned following: "Modern processors support carry-less integer operations, so the characteristic-2 finite field multiplications run fast nowadays.".

Hence, since I know just little about hardware, assembler, and similar stuff, my question is: How do I actually use (in C++) all the modern processors' instructions when building crypto software - whether it is AES, SHA, arithmetic from above or whatever else? I can't find any satisfactory resources on that. My idea is to build a library containing both: "Modern-approach fast implementation" and fallback "pure C++ dependency-less code" and let the GNU Autoconf decide which one to use on each respective host. Any book/article/tutorial recommendation on this topic would be appreciated.

BeeOnRope · Accepted Answer

The question is quite broad because there are several ways you might access the power of the underlying hardware, so instead of one specific way here's a list of ways you can try to use all the modern processors' instructions:

Idiom Recognition

Write out the operation not offered directly in C++ in "long form" and hope your compiler recognizes it as an idiom for the underlying instruction you want. For example, you could write a variable rotate left of x by amount as (x << amount) | (x >> (32 - amount)) and all of gcc, clang and icc will recognize this as a rotate and issue the underlying rol instruction supported by x86.

Sometimes this technique puts you in a bit of an uncomfortable spot: the above C++ rotate implementation exhibits undefined behavior for amount == 0 (and also amount >= 32) since the result of a shift of 32 on a uint32_t is undefined, but the code actually produced by these compilers is just fine in that case. Still, having this lurking undefined behavior in your program is dangerous, and it probably won't run clear against ubsan and friends. The alternative safe version amount ? (x << amount) | (x >> (32 - amount)) : x; is only recognized by icc, but not by gcc or clang.

This approach tends to work for common idioms that map directly to assembly-level instructions that have been around for a while: rotates, bit tests and sets, multiplications with a wider result than inputs (e.g., multiplying two 32-bit values for a 64-bit result), conditional moves and so on, but is less likely to pick up bleeding edge instructions that might also be of interest to cryptography. For example, I'm quite sure no compiler will currently recognize an application of the AES instruction set extensions. It also works best on platforms that have received a lot of effort on the part of the compiler developers since each recognized idiom has to be added by hand.

I don't think this technique will work with your carry-less multiplication (PCLMULQDQ), but maybe one day (if you file an issue against the compilers)? It does work for other "crypt-interesting" functions though, including rotate.

Intrinsic Functions

As an extension compilers will often offer intrinsic functions which are not part of the language proper, but often map directly to an instruction offered by most hardware. Although it looks like a function call, the compiler generally just emits the single instruction needed at the place you call it.

GCC calls these built-in functions and you can find a list of generic ones here. For example, you can use the __builtin_popcnt call to emit the popcnt instruction, if the current target supports it. Man of the gcc builtins are also supported by icc and clang, and in this case all of gcc, clang and icc support this call and emit popcnt as long as the architecture (-march=Haswell)is set to Haswell. Otherwise, clang and icc inline a replacement version using some clever SWAR tricks, while gcc calls __popcountdi2 which is provided by the runtime¹.

The list of intrinsics above are generic and generally offered on any platform the compilers support. You can also find platform specific instrinics, for example this list from gcc.

For x86 SIMD instructions specifically, Intel makes available a set of intrinsic functions declared headers covering their ISA extensions, e.g., by including #include <x86intrin.h>. These have wider support than the gcc instrinsics, e.g., they are supported by Microsoft's Visual Studio compiler suite. New instruction sets are usually added before chips that support them become available, so you can use these to access new instructions immediate on release.

Programming with SIMD intrinsic functions is kind of a halfway house between C++ and full assembly. The compiler still takes care of things like calling conventions and register allocation, and some optimization are made (especially for generating constants and other broadcasts) - but generally what you write is more or less what you get at the assembly level.

Inline Assembly

If your compiler offers it, you can use inline assembly to call whatever instructions you want². This has a lot of similarities to using intrinsic functions, but with a somewhat higher level of difficulty and less opportunities for the optimizer to help you out. You should probably prefer intrinsic functions unless you have a specific reason for inline assembly. One example could be if the optimizer does a really bad job with intrinsics: you could use an inline assembly block to get exactly the code you want.

Out-of-line Assembly

You can also just write your entire kernel function in assembly, assembly it how you want, and then declare it extern "C" and call it from C++. This is similar to the inline assembly option, but works on compilers that don't support inline assembly (e.g., 64-bit Visual Studio). You can also use a different assembler if you want, which is especially convenient if you are targeting multiple C++ compilers since you can then use a single assembler for all of them.

You need to take care of the calling conventions youself, and other messy things like DWARF unwind info and Windows SEH handling.

For very short functions, this approach doesn't work well since the call overhead will likely be prohibitive³.

Auto-Vectorization⁴

If you want to write fast cryptography today for a CPU, you are pretty much going to be targeting mostly SIMD instructions. Most new algorithms designed with software implementation are also designed with vectorization in mind.

You can intrinsic functions or assembly to write SIMD code, but you can also write normal scalar code and rely on the auto-vectorizer. These got a bad name back in the early days of SIMD, and while they are still far from perfect they have come a long way.

Consider this simple function with takes payload and key byte array and xors key into payload:

void otp_scramble(uint8_t* payload, uint8_t* key, size_t n) {
    for (size_t i = 0; i < n; i++) {
        payload[i] ^= key[i];
    }
}

This is a softball example, of course, but anyways gcc, clang and icc all vectorize this to something like this inner loop⁴:

  movdqu xmm0, XMMWORD PTR [rdi+rax]
  movdqu xmm1, XMMWORD PTR [rsi+rax]
  pxor xmm0, xmm1
  movups XMMWORD PTR [rdi+rax], xmm0

It's using SSE instructions to load and xor 16 bytes at a time. The developer only has to reason about the simple scalar code, however!

One advantage of this approach versus intrinsics or assembly is that you aren't baking in the SIMD length of the instruction set at the source level. The same C++ code as above compiled with -march=haswell results in a loop like:

  vmovdqu ymm1, YMMWORD PTR [rdi+rax]
  vpxor ymm0, ymm1, YMMWORD PTR [rsi+rax]
  vmovdqu YMMWORD PTR [rdi+rax], ymm0

It's using the AVX2 instructions available on Haswell to do 32-bytes at a time. If you compile with -march=skylake-avx512 clang uses 64-byte vxorps instructions on zmm registers (but gcc and icc stick with 32-byte inner loops). So in principle you can take some advantage of new ISA simply with a recompile.

A downside of auto-vectorizatoin is that it is fairly fragile. What auto-vectorizes on one compiler might not on another or even on another version of the same compiler. So you need to check you are getting the results you want. The auto-vectorizer is often working with less information than you have: it might not know that the input length is a multiple of some power or two or that the input pointers are aligned in a certain way. Sometimes you can communicate this information to the compiler, but sometimes you can't.

Sometimes the compiler makes "interesting" decisions when it vectorizes, such as a small not-unrolled body for the inner loop, but then a giant "intro" or "outro" handling odd iterations, like what gcc produces after the first loop shown above:

  movzx ecx, BYTE PTR [rsi+rax]
  xor BYTE PTR [rdi+rax], cl
  lea rcx, [rax+1]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+1+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+2]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+2+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+3]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+3+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+4]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+4+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+5]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+5+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+6]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+6+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+7]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+7+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+8]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+8+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+9]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+9+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+10]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+10+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+11]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+11+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+12]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+12+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+13]
  cmp rdx, rcx
  jbe .L1
  movzx r8d, BYTE PTR [rsi+13+rax]
  xor BYTE PTR [rdi+rcx], r8b
  lea rcx, [rax+14]
  cmp rdx, rcx
  jbe .L1
  movzx eax, BYTE PTR [rsi+14+rax]
  xor BYTE PTR [rdi+rcx], al

You probably have better things to spend your instruction cache on (and this is far from the worst I've seen: it's easy to get examples with several hundreds of instructions in the intro and outro parts).

Unfortunately, the vectorizer probably won't produce crypto-specific instructions like carry-less multiply. You could consider a mix of scalar code that gets vectorized and an intrinsic only for the instructions the compiler won't generate, but this is easier to suggest than actually do successfully. At that point you are probably better off writing your entire loop with intrinsics.

¹ The advantage of the gcc approach here is that at runtime if the platform supports popcnt this call can resolve to an implementation that just uses a popcnt instruction, using the GNU IFUNC mechanism.

² Assuming the underlying assembler supports it, but even if it doesn't you could just encode the raw instruction bytes in the inline assembly block.

³ The call overhead includes more than just the explicit costs of the call and ret and argument passing: it also includes the effect on the optimizer which can't optimize code as well in the caller around the function call since it has unknown side-effects.

⁴ In some ways, auto-vectorization could be seen as a special case of idiom recognition, but it is important enough and has enough unique considerations that it gets its own section here.

⁵ With minor differences: gcc is as shown, clang unrolled a bit, and icc used a load-op pxor rather than a separate load.

How to use processor instructions in C++ to implement fast arithmetic operations

Tags:

c++

assembly

processor

instruction-set

jakob

1 Answers

Idiom Recognition

Intrinsic Functions

Inline Assembly

Out-of-line Assembly

Auto-Vectorization⁴

BeeOnRope

Recent Activity

Donate For Us

How to use processor instructions in C++ to implement fast arithmetic operations

Tags:

c++

assembly

processor

instruction-set

jakob

1 Answers

Idiom Recognition

Intrinsic Functions

Inline Assembly

Out-of-line Assembly

Auto-Vectorization4

BeeOnRope

Related questions

Recent Activity

Donate For Us

Auto-Vectorization⁴