Branch prediction and branch target prediction optimization

Tags:

My code makes frequent calls to a function with multiple (unpredictable) branches. When I profiled, I found that it is a minor bottleneck, with the majority of CPU time used on the conditional JMPs.

Consider the following two functions, where the original has multiple explicit branches.

void branch_example_original(void* mem, size_t s)
{
    if(!(s & 7)) {
        /* logic in _process_mem_64 inlined */
    }
    else if(!(s & 3)) {
        /* logic in _process_mem_32 inlined */
    }
    else if(!(s & 1)) {
        /* logic in _process_mem_16 inlined */
    }
    else {
        /* logic in _process_mem_8 inlined */
    }
}

Here is the new function, where I attempted to remove branches causing the bottleneck.

void branch_example_new(void* mem, size_t s)
{
    const fprocess_mem mem_funcs[] = {_process_mem_8, _process_mem_16, _process_mem_32, _process_mem_64};
    const uint32_t magic = 3 - !!(s & 7) - !!(s & 3) - !!(s & 1);
    mem_funcs[magic](mem, size >> magic);
}

However, when I profiled the new code, the performance increased by only ~20%, and the CALL itself (to a func in the mem_funcs array) took a very long time.

Is the second variation simply a more implicit conditional, as the CPU still cannot predict the function that will be called? Am I correct in assuming that this has to do with branch target prediction?

Why does this happen, and are there other solutions to this?

Edit:

Thanks for the ideas, but I'd like an explanation of why this happens as well.

429

asked Aug 29 '15 21:08

frank90

2 Answers

Is the second variation simply a more implicit conditional, as the CPU still cannot predict the function that will be called? Am I correct in assuming that this has to do with branch target prediction?

Yes, unconditional indirect branches require a branch-target-buffer hit for the CPU to figure out where to fetch code from next. Modern CPUs are heavily pipelined, and need to be fetching code well ahead of where they're executing if they're going to avoid bubbles in the pipe where they don't have anything to do. Having to wait until magic is calculated is far too late to avoid an instruction fetch bubble. Performance counters will show BTB misses as a branch mispredict, I think.

As I suggested in a comment, if you can you should restructure your code to do a scalar intro and cleanup around a vectorized loop. The intro handles elements up until you reach an aligned element. The cleanup loop handles cases where there's a non-zero amount of elements left to process, after the last full vector. Then you're not stuck doing a scalar loop just because the size or alignment of the first element wasn't ideal.

Depending on what you're processing, if it's ok to repeat work and overlap, then you can make a branchless startup that does an unaligned chunk, then the rest aligned. Some libraries probably impement memset something like this:

// not shown: check that count >= 16
endp = dest + count;
unaligned_store_16B( dest );    // e.g. x86 movdqu
dest+=16;
dest &= ~0xf;  // align by 16, first aligned write overlaps by up to 15B
for ( ; dest < endp-15 ; dest+=16) {
    aligned_store_16B( dest );  // e.g. x86 movdqa
}
// handle the last up-to-15 bytes from dest to endp similarly.

This makes handling the unaligned start of the loop branchless, because you don't care how much the unaligned start overlapped.

Note that most one-buffer functions aren't repeatable, though. e.g. in-place a[i] *= 2, or sum+=a[i] need to avoid processing the same input twice. Usually with a scalar loop until you get to an aligned address. a[i] &= 0x7f, or maxval = max(a[i], maxval) are exceptions, though.

Functions with two independent pointers that can be misaligned by different amounts are trickier. You have to be careful not to change their relative offset with masking. memcpy is the simplest example of a function that processes data from a src to a dest buffer. memcpy has to work if (src+3) %16 == 0 and (dest+7) %16 ==0. Unless you can put constraints on the callers, the best you can do in general is have either every load or every store aligned in the main loop.

On x86, the unaligned move instructions (movdqu and friends) are just as fast as the alignment-required version when the address is aligned. So you don't need a separate version of the loop for the special case when src and dest have the same (mis)alignment, and the loads and stores can both be aligned. IIRC, this is true for Intel Nehalem and newer CPUs, and for recent AMD.

// check count >= 16
endp = dest + count;
unaligned_copy_16B( dest, src );  // load with movdqu, store with movdqu
// src+=16; dest+=16;  // combine this with aligning dest, below

dest_misalign = dest & 0xf;  // number of bytes the first aligned iteration will overlap
src  += 16 - dest_misalign;  // src potentially still misaligned
dest += 16 - dest_misalign;  // dest aligned

for ( ; dest <= endp-16 ; src+=16, dest+=16) {
    tmpvec = unaligned_load_16B( src ); // x86 movdqu is fast if src is aligned
    aligned_store_16B( dest, tmpvec );  // x86 movdqa
}
// handle the last dest to endp bytes.

An aligned dest is probably more likely than an aligned source. No overlapping repeated work happens when the pointer we align is already aligned.

If you aren't doing memcpy, it can be an advantage to have src aligned so the load can fold into another instruction as a memory operand. This saves an instruction, and in many cases also saves an Intel uop internally.

For the case where src and dest have different alignments, I haven't tested whether it's faster to do aligned loads and unaligned stores, or the other way around. I picked aligned stores because of potential store->load forwarding benefits for short buffers. If the dest buffer is aligned, and only a couple vectors long, and will be read again right away, then aligned loads from dest will stall for ~10 cycles (Intel SnB) if the load crosses a boundary between two preceding stores that haven't made it to L1 cache yet. (i.e. store forwarding fails). See http://agner.org/optimize/ for info on low-level details like this (esp. the microarch guide.)

Store forwarding from memcpy to loads in the next loop is only going to happen if the buffers are small (maybe up to 64B?), or if your next loop starts reading from the end of the buffer (which will still be in cache even if the beginning has already been evicted). Otherwise, the stores to the start of the buffer will have made it from a store buffer to L1, so store-forwarding won't come into play.

It's possible that for large buffers with different alignments, aligned loads and unaligned stores will do better. I'm just making stuff up here, but this could be true if unaligned stores can retire quickly even if they cross a cache line or page line. Of course unaligned loads can't retire until the data is actually loaded. With more load/store instructions in flight, there's less chance of a cache miss stalling things. (You're potentially taking advantage of more of the CPU's load/store buffers.) Again, pure speculation. I tried to google if unaligned stores were better or worse than unaligned loads, but just got hits about how to do them, and misalignment penalties that apply to both.

answered Oct 29 '22 21:10

Peter Cordes

You could try something like this:

switch(s & 7) {
case 0:
    /* _process_mem_64 */
    break;
case 1:
case 3:
case 5:
case 7:
    /* _process_mem_8 */
    break;
case 2:
case 6:
    /* _process_mem_16 */
    break;
case 4:
    /* _process_mem_32 */
    break;
}

This involves only a single jump into a jump table, and does not require a call instruction.

answered Oct 29 '22 23:10

nneonneo

Related questions
                            
                                What's the right way to match #includes (or #defines) using Clang's libtooling?
                            
                                Emplacing in vector using default constructor
                            
                                Is there a gtest equivalent for fixture-level setup/teardown?
                            
                                No exception throw after open a file stream with non-exist files?
                            
                                Why class data members can't be initialized by direct initialization syntax?
                            
                                BOOST_CHECK_CLOSE fails even if the two values are below the threshold
                            
                                c++ - How to get first and second element of pair if used as key in map?
                            
                                What are consequences of adding references to C++ project in Visual Studio?
                            
                                Why can I use ostream cout in a lambda although it wasn't captured?
                            
                                Defining hash function as part of a struct
                            
                                std::get_time - How to check for parsing error
                            
                                Does Boost have unit tests for itself?
                            
                                CMake Error: CMake can not determine linker language for target: myapp
                            
                                Using SFINAE with generic lambdas
                            
                                Why isn't N independent calculations N times faster on N threads?
                            
                                What does the code `[&]()` mean in c++? [duplicate]
                            
                                C++ Changing HWND Window Procedure in runtime
                            
                                Getting milliseconds accuracy current time in Qt
                            
                                Valgrind Error: in use at exit: 72,704 bytes C++ Initialization List weirdness with char*
                            
                                Keep else, catch, on same line as closing bracket in Visual Studio 2015?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Branch prediction and branch target prediction optimization

Tags:

c++

branch-prediction

c

optimization

x86

frank90

People also ask

2 Answers

Peter Cordes

nneonneo

Recent Activity

Donate For Us