Branchless conditionals on integers — fast, but can they be made faster?

Tags:

I've been experimenting with the following and have noticed that the branchless “if” defined here (now with &-!! replacing *!!) can speed up certain bottleneck code by as much as (almost) 2x on 64-bit Intel targets with clang:

// Produces x if f is true, else 0 if f is false.
#define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))

// Produces x if f is true, else y if f is false.
#define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                     ((y) & -((typeof(y)) !(f))))

Note that f should be a reasonably simple expression with no side-effects, so that the compiler is able to do its best optimizations.

Performance is highly dependent on CPU and compiler. The branchless ‘if’ performance is excellent with clang; I haven't found any cases yet where the branchless ‘if/else’ is faster, though.

My question is: are these safe and portable as written (meaning guaranteed to give correct results on all targets), and can they be made faster?

Example usage of branchless if/else

These compute 64-bit minimum and maximum.

inline uint64_t uint64_min(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a <= b), a, b);
}

inline uint64_t uint64_max(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a >= b), a, b);
}

Example usage of branchless if

This is 64-bit modular addition — it computes (a + b) % n. The branching version (not shown) suffers terribly from branch prediction failures, but the branchless version is very fast (at least with clang).

inline uint64_t uint64_add_mod(uint64_t a, uint64_t b, uint64_t n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  uint64_t c = a + b - BRANCHLESS_IF((a >= n - b), n);

  assert(c < n);
  return c;
}

Update: Full concrete working example of branchless if

Below is a full working C11 program that demonstrates the speed difference between branching and a branchless versions of a simple if conditional, if you would like to try it on your system. The program computes modular exponentiation, that is (a ** b) % n, for extremely large values.

To compile, use the following on the command line:

-O3 (or whatever high optimization level you prefer)
-DNDEBUG (to disable assertions, for speed)
Either -DBRANCHLESS=0 or -DBRANCHLESS=1 to specify branching or branchless behavior, respectively

On my system, here's what happens:

$ cc -DBRANCHLESS=0 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 0
CPU time:  21.83 seconds
foo = 10585369126512366091

$ cc -DBRANCHLESS=1 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 1
CPU time:  11.76 seconds
foo = 10585369126512366091

$ cc --version
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz. Intel Core i7).

// SPEED TEST OF MODULAR MULTIPLICATION WITH BRANCHLESS CONDITIONALS

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>

typedef  uint64_t  uint64;

//------------------------------------------------------------------------------
#if BRANCHLESS
  // Actually branchless.
  #define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))
  #define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                       ((y) & -((typeof(y)) !(f))))
#else
  // Not actually branchless, but used for comparison.
  #define  BRANCHLESS_IF(f,x)          ((f)? (x) : 0)
  #define  BRANCHLESS_IF_ELSE(f,x,y)   ((f)? (x) : (y))
#endif

//------------------------------------------------------------------------------
// 64-bit modular multiplication.  Computes (a * b) % n without division.

static uint64 uint64_mul_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  if (a < b) { uint64 t = a; a = b; b = t; }  // Ensure that b <= a.

  uint64 c = 0;
  for (; b != 0; b /= 2)
  {
    // This computes c = (c + a) % n if (b & 1).
    c += BRANCHLESS_IF((b & 1), a - BRANCHLESS_IF((c >= n - a), n));
    assert(c < n);

    // This computes a = (a + a) % n.
    a += a - BRANCHLESS_IF((a >= n - a), n);
    assert(a < n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
// 64-bit modular exponentiation.  Computes (a ** b) % n using modular
// multiplication.

static
uint64 uint64_pow_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n);

  uint64 c = 1;

  for (; b > 0; b /= 2)
  {
    if (b & 1)
      c = uint64_mul_mod(c, a, n);

    a = uint64_mul_mod(a, a, n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
int main(const int argc, const char *const argv[const])
{
  printf("BRANCHLESS = %d\n", BRANCHLESS);

  clock_t clock_start = clock();

  #define SHOW_RESULTS 0

  uint64 foo = 0;  // Used in forcing compiler not to throw away results.

  uint64 n = 3, a = 1, b = 1;
  const uint64 iterations = 1000000;
  for (uint64 iteration = 0; iteration < iterations; iteration++)
  {
    uint64 c = uint64_pow_mod(a%n, b, n);

    if (SHOW_RESULTS)
    {
      printf("(%"PRIu64" ** %"PRIu64") %% %"PRIu64" = %"PRIu64"\n",
             a%n, b, n, c);
    }
    else
    {
      foo ^= c;
    }

    n = n * 3 + 1;
    a = a * 5 + 3;
    b = b * 7 + 5;
  }

  clock_t clock_end = clock();
  double elapsed = (double)(clock_end - clock_start) / CLOCKS_PER_SEC;
  printf("CPU time:  %.2f seconds\n", elapsed);

  printf("foo = %"PRIu64"\n", foo);

  return 0;
}

Second update: Intel vs. ARM performance

Testing on 32-bit ARM targets (iPhone 3GS/4S, iPad 1/2/3/4, as compiled by Xcode 6.1 with clang) reveals that the branchless “if” here is actually about 2–3 times slower than ternary ?: for the modular exponentiation code in those cases. So it seems that these branchless macros are not a good idea if maximum speed is needed, although they might be useful in rare cases where constant speed is needed.
On 64-bit ARM targets (iPhone 6+, iPad 5), the branchless “if” runs the same speed as ternary ?: — again as compiled by Xcode 6.1 with clang.
For both Intel and ARM (as compiled by clang), the branchless “if/else” was about twice as slow as ternary ?: for computing min/max.

437

asked Aug 08 '15 19:08

Todd Lehman

1 Answers

Sure this is portable, the ! operator is guaranteed to give either 0 or 1 as a result. This then is promoted to whatever type is needed by the other operand.

As others observed, your if-else version has the disadvantage to evaluate twice, but you already know that, and if there is no side effect you are fine.

What surprises me is that you say that this is faster. I would have thought that modern compilers perform that sort of optimization themselves.

Edit: So I tested this with two compilers (gcc and clang) and the two values for the configuration.

In fact, if you don't forget to set -DNDEBUG=1, the 0 version with ?: is much better for gcc and does what I would have it expected to do. It basically uses conditional moves to have the loop branchless. In that case clang doesn't find this sort of optimization and does some conditional jumps.

For the version with arithmetic, the performance for gcc worsens. In fact seeing what he does this is not surprising. It really uses imul instructions, and these are slow. clang gets off better here. The "arithmetic" actually has optimized the multiplication out and replaced them by conditional moves.

So to summarize, yes, this is portable, but if this brings performance improvement or worsening will depend on your compiler, its version, the compile flags that you are applying, the potential of your processor ...

198

answered Oct 19 '22 21:10

Jens Gustedt

Related questions
                            
                                Is if(TRUE) a good idea in C?
                            
                                What is a good desktop programming language to learn for a web developer? [closed]
                            
                                What is a neat way of breaking out of many for loops at once?
                            
                                Decode video in Raspberry Pi without using OpenMAX?
                            
                                Compound literal and designated initializer warning from GCC but not Clang
                            
                                Tool for tracing C preprocessor execution during macro expansion?
                            
                                How to get `gcc` to generate `bts` instruction for x86-64 from standard C?
                            
                                Bit-fields and sequence points
                            
                                Is there memory protection on GPUs
                            
                                What kind of optimizations are included in -funsafe-math-optimizations?
                            
                                How to determine ARMv8 features at runtime on iOS?
                            
                                How do I prevent a quoted include from searching the directory of the current source file?
                            
                                How would you generically detect cache line associativity from user mode code?
                            
                                Basic GPU application, integer calculations
                            
                                How to set gcc 4.3 default specs file?
                            
                                Defining Undefined Behavior
                            
                                Is there a difference between mmap MAP_SHARED and MAP_PRIVATE when PROT_READ is also used?
                            
                                How "Hello World" works in C [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Branchless conditionals on integers — fast, but can they be made faster?

Tags:

branch-prediction

c

conditional-statements

macros

c11