Branchless internal merge slower than internal merge with branch

Question

I recently asked a question on Code Review to review a sorting algorithm named QuickMergeSort. I won't get in the details, but at some point the algorithm performs an internal mergesort: instead of using additional memory to store the data to merge, it swaps the elements to merge with elements from another part of the original sequence, which isn't otherwise concerned by the merge. Here is the part of the algorithm I'm concerned with: the function that performs the merging:

template<
    typename InputIterator1,
    typename InputIterator2,
    typename OutputIterator,
    typename Compare = std::less<>
>
auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
                        InputIterator2 first2, InputIterator2 last2,
                        OutputIterator result, Compare compare={})
    -> void
{
    for (; first1 != last1; ++result) {
        if (first2 == last2) {
            std::swap_ranges(first1, last1, result);
            return;
        }

        if (compare(*first2, *first1)) {
            std::iter_swap(result, first2);
            ++first2;
        } else {
            std::iter_swap(result, first1);
            ++first1;
        }
    }
    // first2 through last2 are already in the right spot
}

That function was adapted from the eponym function in libc++ implementation of std::inplace_merge; this new version swaps elements with another part of the original array instead of moving elements from the auxiliary array.

Since the merge is internal, I realized that I didn't actually need to have two separate input types: InputIterator1 and InputIterator2 are always the same. Then I came to realize that, since the operations on first1 and first2 were always the same, I could store them in a two-element array and use the result of the comparison to index the array to know which iterator to swap and to increment. With that small trick, I get rid of the branch and obtain a mostly branchless merge algorithm:

template<
    typename InputIterator,
    typename OutputIterator,
    typename Compare = std::less<>
>
auto half_inplace_merge(InputIterator first1, InputIterator last1,
                        InputIterator first2, InputIterator last2,
                        OutputIterator result, Compare compare={})
    -> void
{
    InputIterator store[] = { first1, first2 };

    for (; store[0] != last1; ++result) {
        if (store[1] == last2) {
            std::swap_ranges(store[0], last1, result);
            return;
        }

        bool cmp = compare(*store[1], *store[0]);
        std::iter_swap(result, store[cmp]);
        ++store[cmp];
    }
    // first2 through last2 are already in the right spot
}

Now, the thing is: with this new half_inplace_merge function, the overall sorting algorithm is 1.5 times slower than with the original half_inplace_merge, and I've got no idea why. I've tried several compiler optimization levels, several tricks to avoid potential aliasing problems, but it seems that the problem comes from the branchless trick itself.

So, is anybody able to explain why the branchless code is slower?

Addendum: for those who want to run the same benchmark as I did... well, it will be a bit difficult: I used the benchmarks from a personal library, which include many things; you'll need to download the library, to add this file somewhere, and to run this benchmark after having added the required line to invoke quick_merge_sort near the highlighted section (you will need to redirect the standard output of the program to a file in a profiles subdirectory). Then you'll need to run this Python script to see the results, adding quick_merge_sort to the highlighted line. Note that NumPy and matplotlib need to be installed.

Douglas Daseeco · Accepted Answer

Such a large difference is the product of two conditions.

The first condition is related to the original code. The in-place merge is so efficient there would be difficulty devising anything significantly faster, even if coding manually at the assembly language level. The application of generics is straightforward, so the compiler ** produced the same assembly with or without it. Because the algorithm implementation is efficient, only a few machine instructions added into the loop is capable of producing the significant proportional change indicated in the question.

** Compilation specifics throughout this answer were using g++ 6.2.1 20160916, the default Fedora 24 dnf package, along with LINUX kernel 4.8.8-200.fc24.x86_64. Runtime was Intel i7-2600 8M cache. Also to Atmel SAM3X8E ARM Cortex-M3 with arm-none-eabi-g++ 4.8.3-2014q1.

The second condition is related to the compilation of the second trick described in paragraph 3 sentence 2 of the question. The first trick, the reduction of types in the template, did not produce any significant change in the assembly language. The second trick produced flop-affecting assembly level differences in the compiler output for the two calls.

This precompiler hack can ease testing.

#ifdef ORIG
#define half_inplace_merge half_inplace_merge_orig
#else // ORIG
#define half_inplace_merge half_inplace_merge_slow
#endif // ORIG
...
half_inplace_merge(niInA.begin(), niInA.end(),
        niInB.begin(), niInB.end(),
        niOut.begin(), compare);

Execution and compare using these commands in a bash shell exploits the precompiler hack.

g++ -DORIG -S -fverbose-asm -o /tmp/qq.orig.s /tmp/qq.cpp
g++ -DSLOW -S -fverbose-asm -o /tmp/qq.slow.s /tmp/qq.cpp
araxis.sh /tmp/qq.orig.s /tmp/qq.slow.s  # to run Araxis Merge in Wine

These instructions are a result of initializing the InputIterator store[ ], but that is outside the loop.

leaq    -48(%rbp), %rax #, _4
movq    -64(%rbp), %rdx # first1, tmp104
movq    %rdx, (%rax)    # tmp104, *_5
leaq    8(%rax), %rdx   #, _9
movq    -96(%rbp), %rax # first2, tmp105
movq    %rax, (%rdx)    # tmp105, *_9

The primary slow down comes in dereferencing the two items contained in store[ ], as needed by the compare and swap, and that is within the loop. These instructions do not exist in the version without the second trick.

movb    %al, -17(%rbp)  # _27, cmp
movzbl  -17(%rbp), %eax # cmp, _29
cltq
...
movzbl  -17(%rbp), %edx # cmp, _31
leaq    -48(%rbp), %rax #, tmp121
movslq  %edx, %rdx  # _31, tmp122
salq    $3, %rdx    #, tmp123
addq    %rdx, %rax  # tmp123, _32

Although there is duplication of code in the bodies of the conditional for the version without the trick, that only impacts the compactness of the code, adding two calls, five moves, and one compare instruction. The number of CPU cycles required to perform the in-place merge is the same between branches resulting from the compare, and both lack the instructions listed above.

For each of several syntax permutations tried, removing the redundancy in the branches to improve compactness inescapably leads to additional instructions required along the execution path.

The details of the instruction sequences for the various permutations discussed thus far will vary from compiler to compiler, optimization option selection, and even the conditions of calling the functions.

It is theoretically possible for a compiler to employ an AST (abstract symbol tree) refactoring rule (or the equivalent) to detect and reduce both program memory and CPU cycle requirements for either version of the function. Such rules have have antecedents (search patterns) that match the pattern to be optimized within the code.

Optimizing speed for the code with the second trick would require a rule antecedent that matches at the atypical score[ ] abstraction both inside and outside of the loop. Detecting the branch redundancy without the second trick is a more reasonable goal.

Integrating the two statements within each branch, one can see how the two like patterns in the AST might be simple enough for a refactoring rule antecedent to match and perform the desired code size reduction. There would be very little gain in speed for this case, if any.

if (compare(*first2, *first1)) {
    std::iter_swap(result, first2 ++);
} else {
    std::iter_swap(result, first1 ++);
}

Hans Olsson · Answer

The following is just a short intuitive explanation:

If we scale away everything and assume that the iterators are normal pointers we can in the first example store all iterators in registers.

In the branch-less code we cannot easily do that, due to store[cmp], and ++store[cmp] - and that implies an overhead for all use of store[0], and store[1].

Thus (in this case) it is more important to maximize use of registers than avoiding branches.

Branchless internal merge slower than internal merge with branch

Tags:

c++

performance

branch-prediction

sorting

Morwenn

2 Answers

Douglas Daseeco

Hans Olsson

Recent Activity

Donate For Us

Branchless internal merge slower than internal merge with branch

Tags:

c++

performance

branch-prediction

sorting

Morwenn

2 Answers

Douglas Daseeco

Hans Olsson

Related questions

Recent Activity

Donate For Us