Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which integer operations have higher performance alternate methods in Rust?

When writing integer functions in Rust which will run millions of times (think pixel processing), it's useful to use operations with the highest performance - similar to C/C++.

While the reference manual explains changes in behavior, it's not always clear which methods are higher performance than the standard (see note 1.) integer arithmetic operations. I'd assume wrapping_add compiles down to something equivalent to C's addition.

Of the standard operations (add / subtract / multiply / modulo / divide / shift / bit manipulation...), which operations have higher performance alternatives which aren't used by default?


Note:

  1. By standard I mean integer arithmetic using symbols a + b, i / k or c % e ... etc.
    What you would use when writing math expressions - unless you have a special need for using one of the methods that wraps or returns the overflow.
  2. I realize answering this question may require some research. So I'm happy to do some checks by looking at resulting assembly to see which operations are using unchecked/primitive operations.
  3. It may be that the speed difference between checked / unchecked operations isn't significant, if that's the case I'd still like to be able to write a 'fast' version of a function to compare against the 'safe' version, to come to my own conclusion as to whether it's a reasonable choice for a given function.
  4. Having mentioned pixel-processing, SIMD has come up as a possible solution. Even though this is a good suggestion. That still leaves us with the cases which can't be optimized using SIMD, so the general case of fast integer arithmetic is still something to consider.
like image 333
ideasman42 Avatar asked Dec 12 '16 13:12

ideasman42


3 Answers

Of the standard operations (add / subtract / multiply / modulo / divide / shift / bit manipulation...), which operations have higher performance alternatives which aren't used by default?

Note that Rust was designed for performance; as a result, while integer operations are checked in Debug, they are defined to wrap in Release unless you specifically instruct the compiler otherwise.

As a result, in release mode with default options, there is strictly no performance difference between:

  • + and wrapping_add
  • - and wrapping_sub
  • * and wrapping_mul
  • / and wrapping_div
  • % and wrapping_rem
  • << and wrapping_shl
  • >> and wrapping_shr

For unsigned integers, the performance is thus strictly like that of C or C++; for signed integers, however, the optimizer might yield different results since underflow/overflow on signed integers is undefined behavior in C and C++ (gcc and Clang accept a -fwrapv flag to mandate wrapping even for signed integers, but it's not the default).

I expect that using the checked_*, overflow_* and saturating_* methods will however be slower in general.


An interesting tangent, then, is to understand what happens when you flip the switch and explicitly require checked arithmetic.

Currently, the Rust implementation1 is a precise implementation of underflow/overflow checking. Each addition, subtraction, multiplication, ... is checked independently, and the optimizer is not good at fusing those branches.

Specifically, a precise implementation precludes temporary overflows: 5 + x - 5 cannot be optimized as x, because 5 + x could overflow. It also precludes auto-vectorization in general.

Only when the optimizer can prove the absence of overflow (which it generally cannot) you may hope to regain a branch-free path which is more amenable to optimizations.

One should note that on general software the impact is barely noticeable, as arithmetic instructions represent a small portion of the overall cost. When this proportion rises however, it can be very noticeable, and indeed it shows up in part of the SPEC2006 benchmark with Clang.

This overhead was sufficient to be deemed unsuitable for the checks to be activated by default.

1This is due to technical limitations on LLVM side; the Rust implementation just delegates to LLVM.


In the future, there is hope that a fuzzy implementation of the checks would be available. The idea behind a fuzzy implementation is that instead of checking each and every operation, they are just executed and a flag is set or the values are poisoned in case of underflow/overflow. Then, before using the result, a check (branch) is executed.

According to Joe Duffy, they had such an implementation in Midori and the performance impact was barely noticeable, so it seems to be feasible. I am not aware of any effort to have anything similar in LLVM yet, though.

like image 80
Matthieu M. Avatar answered Nov 16 '22 18:11

Matthieu M.


Rust gives no guarantees as to the speed of its operations. If you want guarantees, you need to call into assembler.

That said, currently Rust forwards to LLVM, so you can just call the intrinsics, which map 1:1 to LLVM intrinsics and use those guarantees. Still, whatever you do that's not asm, be aware that the optimizer might have a different opinion of what you consider optimal, and thus unoptimize your manual calls to LLVM intrinsics.

That said, Rust strives to be as fast as possible, so you can assume (or simply look at the standard library's implementation) that all operations that have an LLVM intrinsic that does the same will map to that LLVM intrinsic and thus be as fast as LLVM can do it.

There is no general rule as to which operation is the fastest for a given basic arithmetic operation, since it totally depends on your use case.

like image 5
oli_obk Avatar answered Nov 16 '22 18:11

oli_obk


think pixel processing

Then you shouldn't be thinking single-valued operations at all; you want to use SIMD instructions instead. These are currently not available in stable Rust, but some are accessible through feature-gated functions and all are available through assembly.

Is it possible LLVM optimizes code into SIMD, like it does for clang?

As aochagavia already replied, yes, LLVM will autovectorize certain types of code. However, when you demand the highest performance, you don't usually want to leave yourself at the whims of the optimizer. I tend to hope for autovectorization in my normal run-of-the-mill code, then write the straight-line code for my heavy-math kernels, then write SIMD code and test for correctness and benchmark for speed.

like image 3
Shepmaster Avatar answered Nov 16 '22 16:11

Shepmaster