I was reading sehe's answer to this question and was surprised to see sehe found using a hand written loop using std::memchr
to be over 3 times faster than using std::count
(see comments). The code using std::count
can be seen in edit 2, but it basically boils down to:
const auto num_lines = std::count(f, l, '\n');
vs
uintmax_t num_lines = 0;
while (f && f != l)
if ((f = static_cast<const char*>(memchr(f, '\n', l - f))))
num_lines++, f++;
I would have expected the std::count
version to be at least as fast as the std::memchr
one - for a similar reason to why using std::copy
should be at least as fast as std::memcpy
.
I checked my standard library's (libc++) implementation of std::count
, and there is no attempt to optimise for char
input types (ditto for std::find
).
Why is this? Could the implementations not dispatch to std::memchr
if provided with char*
iterators, and a char
value?
Using an actual function call to memchr
is only a win if the average distance between matches is not small.
Especially for count
, calling memchr
could be a lot slower if you were counting t
characters when they appear on average every 2 or maybe every 4. (e.g. with DNA base-pairs using an alphabet of ACGT).
I'd be skeptical of using a memchr
loop as the default implementation for std::count
over char
arrays. There are more likely other ways to tweak the source so it compiles to better asm.
For find
it would make more sense, even though it does potentially increase the overhead significantly vs. a simple byte-at-a-time loop if there's a hit in the first couple bytes.
You could also look at this as a compiler missed-optimization. If compilers made better code for the loops in std::count
and std::find
, there'd be less to gain from calling hand-written asm library functions.
gcc and clang never auto-vectorize loops when the trip-count isn't known before entering the loop. (i.e. they don't do search loops, which is a major missed optimization for element sizes as small as bytes). ICC doesn't have this limitation, and can vectorize search loops. I haven't looked at how it does with libc++'s std::count or find, though.
std::count
has to check every element, so it should auto-vectorize. But if gcc or clang don't even with -O3
, then that's unfortunate. It should vectorize very well on x86 with pcmpeqb
(packed compare bytes), and then paddb
the 0/-1 compare results. (every 255 iterations at least, psadbw
against zero to horizontally sum the byte elements).
Library function call overhead is at least an indirect call with a function pointer from memory (which can cache miss). On Linux with dynamic linking there's usually an extra jmp
through the PLT as well (unless you compiled with -fno-plt
). memchr
is easier to optimize with low startup overhead than strchr
, because you can quickly check whether a 16B vector load can go past the end (vs. aligning the pointer for strchr
or strlen
to avoid crossing a page or cache-line boundary)
If calling memchr
is the best way to implement something in asm, then in theory that's what the compiler should emit. gcc/clang already optimize large copy loops to calls to libc memcpy
, depending on target options (-march=
). e.g. when the copy is large enough that the libc version may decide to use NT stores on x86.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With