I realize I've got a great lack of knowledge in that area (fancy way of saying I don't know jack about it). Is there some documentation as to how and when to use them?

Apart from all the twiddling based branchless code (which won't cover everything, such as FP), you get instructions specifically meant for branchless code creation, these would be <code>SETcc</code>, <code>FCMOVcc</code> and <code>CMOVcc</code> under x86, which perform operations based on the condition flags from a comparison. a really simple example would be (yes, the example is so simple that one would probably never write something like this, its only to demonstrated a point clearly): <pre class="prettyprint"><code>bool CheckZero(int x) { if(x == 0) return true; return false; //OR we can use: return (x == 0) ? true : false; } </code></pre> now a simple x86 compiler might compile that down to: <pre class="prettyprint"><code> MOV EAX,[ESP + 4] TEXT EAX,EAX JE _TRUE XOR EAX,EAX RETN 4 _TRUE: MOV EAX,1 RETN 4 </code></pre> an optimizing x86 compiler would take it down into the following branchless code (or similar): <pre class="prettyprint"><code>MOV ECX,[ESP + 4] XOR EAX,EAX TEST ECX,ECX SETE AL RETN 4 </code></pre> A little more complex example can be found here. However, this is something that a compiler will perform, and not some you should worry about yourself (at least not without analyzing the output of your compiler). but if the code is required to be branchless without fail, C++ doesn't provide enough control to do so, so you need to use (inline) assembly.

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible. In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0. Consider <code>inline int bct_decoder( unsigned bctData )</code> function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example: <pre class="prettyprint"><code>inline int bct_decoder_cond( unsigned bctData ) { unsigned lsB = bctData & 1; unsigned msB = bctData >> 1; return ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch ( lsB > msB ) ? 1 : -1; } </code></pre> This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset) Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS: <pre class="prettyprint"><code>inline int bct_decoder_mod( unsigned bctData ) { return ( int )( ( bctData + 1 ) % 3 ) - 1; } </code></pre> Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all: <pre class="prettyprint"><code>inline int bct_decoder_math( unsigned bctData ) { return ( int )( bctData & 1 ) - ( int )( bctData >> 1 ); } </code></pre> This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself. If this is not enough then we can go futher, having something special. Next is called lookup table approach: <pre class="prettyprint"><code>inline int bct_decoder_lut( unsigned bctData ) { static const int decoderLUT[] = { 0, 1, -1, 0 }; return decoderLUT[ bctData & 0x3 ]; } </code></pre> In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single <code>AND</code> instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it. So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

What does a branchless if in C++ look like?

2 Answers

Apart from all the twiddling based branchless code (which won't cover everything, such as FP), you get instructions specifically meant for branchless code creation, these would be SETcc, FCMOVcc and CMOVcc under x86, which perform operations based on the condition flags from a comparison.

a really simple example would be (yes, the example is so simple that one would probably never write something like this, its only to demonstrated a point clearly):

bool CheckZero(int x)
{
    if(x == 0)
       return true;

    return false;
    //OR we can use: return (x == 0) ? true : false; 
}

now a simple x86 compiler might compile that down to:

    MOV EAX,[ESP + 4]
    TEXT EAX,EAX
    JE _TRUE
    XOR EAX,EAX
    RETN 4

_TRUE:
    MOV EAX,1
    RETN 4

an optimizing x86 compiler would take it down into the following branchless code (or similar):

MOV ECX,[ESP + 4]
XOR EAX,EAX
TEST ECX,ECX
SETE AL
RETN 4

A little more complex example can be found here.

However, this is something that a compiler will perform, and not some you should worry about yourself (at least not without analyzing the output of your compiler). but if the code is required to be branchless without fail, C++ doesn't provide enough control to do so, so you need to use (inline) assembly.

120

answered Nov 03 '22 11:11

Necrolis

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.

In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.

Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them

First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:

inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}

This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)

Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:

inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}

Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:

inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}

This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.

If this is not enough then we can go futher, having something special. Next is called lookup table approach:

inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}

In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.

So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

answered Nov 03 '22 09:11

xakepp35

Related questions
                            
                                What is the importance of making a variable a constant? [closed]
                            
                                Accessing struct data members via pointer arithmetic
                            
                                Missing msvcr80.dll
                            
                                How do I create a sqllite3 in-memory database?
                            
                                why would std::string s("??<") output a { instead of ??< as expected?
                            
                                How to reinterpret the bits of a float as an int
                            
                                Common Uses For Pointers?
                            
                                Socket listen doesn't unbind in C++ under linux
                            
                                What happens if I use "throw;" without an exception to throw?
                            
                                Register C function in Lua table
                            
                                Use the keyword class as a variable name in C++
                            
                                Dealing with char buffers
                            
                                How Visitor Pattern avoid downcasting
                            
                                How to add valid key without specifying value to a std::map?
                            
                                Detect when a Module (DLL) is unloaded
                            
                                Producing random float from negative to positive range?
                            
                                How does iterator category in C++ work?
                            
                                A newbie question in Qt
                            
                                Shall I use cerr
                            
                                Algorithm to determine coin combinations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does a branchless if in C++ look like?

Tags:

c++

branch

if-statement

MPelletier

People also ask

2 Answers

Necrolis

xakepp35

Recent Activity

Donate For Us