I am looking at the performance of memchr
-like functions and made an interesting observation.
This is check.c
with 3 implementations to find the offset of a \n
character in a string:
#include <stdlib.h>
size_t mem1(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n') return (p - s);
p++;
}
}
size_t mem2(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n' || x == '\0')) return (p - s);
p++;
}
}
size_t mem3(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n' || x == '\0') return (p - s);
p++;
}
}
size_t mem4(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n')) return (p - s);
p++;
}
}
I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello"
- that is 10000 times asdf
, then the newline to find, and then hello
. Of course all 3 implementations return the same offset: 40000 as expected.
Interestingly, when compiled with gcc -O2
, the run times on that string are:
mem1
: 16 usmem2
: 12 usmem3
: 25 usmem4
: 16 us(I'm using the criterion library to measure these times with statistical accuracy.)
I cannot explain this to myself. Why is mem2
so much faster than the other two?
--
The assembly as generated by gcc -S -O2 -o check.asm check.c
:
mem1:
.LFB14:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L9
.L6:
addq $1, %rax
cmpb $10, (%rax)
jne .L6
subq %rdi, %rax
ret
.L9:
xorl %eax, %eax
ret
mem2:
.LFB15:
movq %rdi, %rax
jmp .L13
.L19:
cmpb $10, %dl
je .L14
.L11:
addq $1, %rax
.L13:
movzbl (%rax), %edx
cmpb $36, %dl
jg .L11
testb %dl, %dl
jne .L19
.L14:
subq %rdi, %rax
ret
mem3:
.LFB16:
movzbl (%rdi), %edx
testb %dl, %dl
je .L26
cmpb $10, %dl
movq %rdi, %rax
jne .L27
jmp .L26
.L30:
cmpb $10, %dl
je .L23
.L27:
addq $1, %rax
movzbl (%rax), %edx
testb %dl, %dl
jne .L30
.L23:
subq %rdi, %rax
ret
.L26:
xorl %eax, %eax
ret
mem4:
.LFB17:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L38
.L36:
addq $1, %rax
cmpb $10, (%rax)
jne .L36
subq %rdi, %rax
ret
.L38:
xorl %eax, %eax
ret
Any explanation is very appreciated!
Faster for() loops If you don't care about the order of the loop counter, you can do this instead: for( i=10; i--; ) { ... } Using this code, i loops through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This works because it is quicker to process "i--" as the test condition, which says "is i non-zero?
As we can see from the above examples separate loops are faster in addition than in combined loops.
In Python, when passed to pseudocode, the print statement is a call to a C (interpreter) function, likewise, C++ makes the inner loop faster.
My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in mem1
, you have a circular dependency on rax
. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit.
In mem2
the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both rax
and edx/dl
- is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently.
I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis.
Edit: Out of interest, I've tested mem1
and mem2
on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's time
, I get ~19s for mem1
and ~18.8s for mem2
- much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...
Your input is such that makes mem2
faster. Every letter in the input apart from '\n' has value larger than '$', so if
condition is false from the first part of the expression (x <= '$'), and second part of the expression (x == '\n' || x == '\0') is never executed. If you would use "####" instead of "abcd" I suspect the execution would become slower.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With