Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why aren't std::count and std::find optimised to use memchr?

I was reading sehe's answer to this question and was surprised to see sehe found using a hand written loop using std::memchr to be over 3 times faster than using std::count (see comments). The code using std::count can be seen in edit 2, but it basically boils down to:

const auto num_lines = std::count(f, l, '\n');

vs

uintmax_t num_lines = 0;
while (f && f != l)
    if ((f = static_cast<const char*>(memchr(f, '\n', l - f))))
        num_lines++, f++;

I would have expected the std::count version to be at least as fast as the std::memchr one - for a similar reason to why using std::copy should be at least as fast as std::memcpy.

I checked my standard library's (libc++) implementation of std::count, and there is no attempt to optimise for char input types (ditto for std::find).

Why is this? Could the implementations not dispatch to std::memchr if provided with char* iterators, and a char value?

like image 797
Daniel Avatar asked Apr 18 '17 22:04

Daniel


1 Answers

Using an actual function call to memchr is only a win if the average distance between matches is not small.

Especially for count, calling memchr could be a lot slower if you were counting t characters when they appear on average every 2 or maybe every 4. (e.g. with DNA base-pairs using an alphabet of ACGT).

I'd be skeptical of using a memchr loop as the default implementation for std::count over char arrays. There are more likely other ways to tweak the source so it compiles to better asm.

For find it would make more sense, even though it does potentially increase the overhead significantly vs. a simple byte-at-a-time loop if there's a hit in the first couple bytes.


You could also look at this as a compiler missed-optimization. If compilers made better code for the loops in std::count and std::find, there'd be less to gain from calling hand-written asm library functions.

gcc and clang never auto-vectorize loops when the trip-count isn't known before entering the loop. (i.e. they don't do search loops, which is a major missed optimization for element sizes as small as bytes). ICC doesn't have this limitation, and can vectorize search loops. I haven't looked at how it does with libc++'s std::count or find, though.

std::count has to check every element, so it should auto-vectorize. But if gcc or clang don't even with -O3, then that's unfortunate. It should vectorize very well on x86 with pcmpeqb (packed compare bytes), and then paddb the 0/-1 compare results. (every 255 iterations at least, psadbw against zero to horizontally sum the byte elements).

Library function call overhead is at least an indirect call with a function pointer from memory (which can cache miss). On Linux with dynamic linking there's usually an extra jmp through the PLT as well (unless you compiled with -fno-plt). memchr is easier to optimize with low startup overhead than strchr, because you can quickly check whether a 16B vector load can go past the end (vs. aligning the pointer for strchr or strlen to avoid crossing a page or cache-line boundary)

If calling memchr is the best way to implement something in asm, then in theory that's what the compiler should emit. gcc/clang already optimize large copy loops to calls to libc memcpy, depending on target options (-march=). e.g. when the copy is large enough that the libc version may decide to use NT stores on x86.

like image 77
Peter Cordes Avatar answered Sep 28 '22 13:09

Peter Cordes