Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Branchless conditionals on integers — fast, but can they be made faster?

I've been experimenting with the following and have noticed that the branchless “if” defined here (now with &-!! replacing *!!) can speed up certain bottleneck code by as much as (almost) 2x on 64-bit Intel targets with clang:

// Produces x if f is true, else 0 if f is false.
#define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))

// Produces x if f is true, else y if f is false.
#define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                     ((y) & -((typeof(y)) !(f))))

Note that f should be a reasonably simple expression with no side-effects, so that the compiler is able to do its best optimizations.

Performance is highly dependent on CPU and compiler. The branchless ‘if’ performance is excellent with clang; I haven't found any cases yet where the branchless ‘if/else’ is faster, though.

My question is: are these safe and portable as written (meaning guaranteed to give correct results on all targets), and can they be made faster?

Example usage of branchless if/else

These compute 64-bit minimum and maximum.

inline uint64_t uint64_min(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a <= b), a, b);
}

inline uint64_t uint64_max(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a >= b), a, b);
}

Example usage of branchless if

This is 64-bit modular addition — it computes (a + b) % n. The branching version (not shown) suffers terribly from branch prediction failures, but the branchless version is very fast (at least with clang).

inline uint64_t uint64_add_mod(uint64_t a, uint64_t b, uint64_t n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  uint64_t c = a + b - BRANCHLESS_IF((a >= n - b), n);

  assert(c < n);
  return c;
}

Update: Full concrete working example of branchless if

Below is a full working C11 program that demonstrates the speed difference between branching and a branchless versions of a simple if conditional, if you would like to try it on your system. The program computes modular exponentiation, that is (a ** b) % n, for extremely large values.

To compile, use the following on the command line:

  • -O3 (or whatever high optimization level you prefer)
  • -DNDEBUG (to disable assertions, for speed)
  • Either -DBRANCHLESS=0 or -DBRANCHLESS=1 to specify branching or branchless behavior, respectively

On my system, here's what happens:

$ cc -DBRANCHLESS=0 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 0
CPU time:  21.83 seconds
foo = 10585369126512366091

$ cc -DBRANCHLESS=1 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 1
CPU time:  11.76 seconds
foo = 10585369126512366091

$ cc --version
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz. Intel Core i7).

// SPEED TEST OF MODULAR MULTIPLICATION WITH BRANCHLESS CONDITIONALS

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>

typedef  uint64_t  uint64;

//------------------------------------------------------------------------------
#if BRANCHLESS
  // Actually branchless.
  #define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))
  #define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                       ((y) & -((typeof(y)) !(f))))
#else
  // Not actually branchless, but used for comparison.
  #define  BRANCHLESS_IF(f,x)          ((f)? (x) : 0)
  #define  BRANCHLESS_IF_ELSE(f,x,y)   ((f)? (x) : (y))
#endif

//------------------------------------------------------------------------------
// 64-bit modular multiplication.  Computes (a * b) % n without division.

static uint64 uint64_mul_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  if (a < b) { uint64 t = a; a = b; b = t; }  // Ensure that b <= a.

  uint64 c = 0;
  for (; b != 0; b /= 2)
  {
    // This computes c = (c + a) % n if (b & 1).
    c += BRANCHLESS_IF((b & 1), a - BRANCHLESS_IF((c >= n - a), n));
    assert(c < n);

    // This computes a = (a + a) % n.
    a += a - BRANCHLESS_IF((a >= n - a), n);
    assert(a < n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
// 64-bit modular exponentiation.  Computes (a ** b) % n using modular
// multiplication.

static
uint64 uint64_pow_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n);

  uint64 c = 1;

  for (; b > 0; b /= 2)
  {
    if (b & 1)
      c = uint64_mul_mod(c, a, n);

    a = uint64_mul_mod(a, a, n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
int main(const int argc, const char *const argv[const])
{
  printf("BRANCHLESS = %d\n", BRANCHLESS);

  clock_t clock_start = clock();

  #define SHOW_RESULTS 0

  uint64 foo = 0;  // Used in forcing compiler not to throw away results.

  uint64 n = 3, a = 1, b = 1;
  const uint64 iterations = 1000000;
  for (uint64 iteration = 0; iteration < iterations; iteration++)
  {
    uint64 c = uint64_pow_mod(a%n, b, n);

    if (SHOW_RESULTS)
    {
      printf("(%"PRIu64" ** %"PRIu64") %% %"PRIu64" = %"PRIu64"\n",
             a%n, b, n, c);
    }
    else
    {
      foo ^= c;
    }

    n = n * 3 + 1;
    a = a * 5 + 3;
    b = b * 7 + 5;
  }

  clock_t clock_end = clock();
  double elapsed = (double)(clock_end - clock_start) / CLOCKS_PER_SEC;
  printf("CPU time:  %.2f seconds\n", elapsed);

  printf("foo = %"PRIu64"\n", foo);

  return 0;
}

Second update: Intel vs. ARM performance

  • Testing on 32-bit ARM targets (iPhone 3GS/4S, iPad 1/2/3/4, as compiled by Xcode 6.1 with clang) reveals that the branchless “if” here is actually about 2–3 times slower than ternary ?: for the modular exponentiation code in those cases. So it seems that these branchless macros are not a good idea if maximum speed is needed, although they might be useful in rare cases where constant speed is needed.
  • On 64-bit ARM targets (iPhone 6+, iPad 5), the branchless “if” runs the same speed as ternary ?: — again as compiled by Xcode 6.1 with clang.
  • For both Intel and ARM (as compiled by clang), the branchless “if/else” was about twice as slow as ternary ?: for computing min/max.
like image 437
Todd Lehman Avatar asked Aug 08 '15 19:08

Todd Lehman


People also ask

How much faster is branchless programming?

So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz.

How do I stop branching code?

I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.

What is branchless programming?

Branchless programming is a programming technique that eliminates the branches (if, switch, and other conditional statements) from the program. Although this is not much relevant these days with extremely powerful systems and usage of interpreted languages( especially dynamic typed ones).


1 Answers

Sure this is portable, the ! operator is guaranteed to give either 0 or 1 as a result. This then is promoted to whatever type is needed by the other operand.

As others observed, your if-else version has the disadvantage to evaluate twice, but you already know that, and if there is no side effect you are fine.

What surprises me is that you say that this is faster. I would have thought that modern compilers perform that sort of optimization themselves.

Edit: So I tested this with two compilers (gcc and clang) and the two values for the configuration.

In fact, if you don't forget to set -DNDEBUG=1, the 0 version with ?: is much better for gcc and does what I would have it expected to do. It basically uses conditional moves to have the loop branchless. In that case clang doesn't find this sort of optimization and does some conditional jumps.

For the version with arithmetic, the performance for gcc worsens. In fact seeing what he does this is not surprising. It really uses imul instructions, and these are slow. clang gets off better here. The "arithmetic" actually has optimized the multiplication out and replaced them by conditional moves.

So to summarize, yes, this is portable, but if this brings performance improvement or worsening will depend on your compiler, its version, the compile flags that you are applying, the potential of your processor ...

like image 198
Jens Gustedt Avatar answered Oct 19 '22 21:10

Jens Gustedt