I am looking at the performance of <code>memchr</code>-like functions and made an interesting observation. This is <code>check.c</code> with 3 implementations to find the offset of a <code>\n</code> character in a string: <pre class="prettyprint"><code>#include <stdlib.h> size_t mem1(const char *s) { const char *p = s; while (1) { const char x = *p; if (x == '\n') return (p - s); p++; } } size_t mem2(const char *s) { const char *p = s; while (1) { const char x = *p; if (x <= '$' && (x == '\n' || x == '\0')) return (p - s); p++; } } size_t mem3(const char *s) { const char *p = s; while (1) { const char x = *p; if (x == '\n' || x == '\0') return (p - s); p++; } } size_t mem4(const char *s) { const char *p = s; while (1) { const char x = *p; if (x <= '$' && (x == '\n')) return (p - s); p++; } } </code></pre> I run these functions on a string of bytes which can be described by the Haskell expression <code>(concat $ replicate 10000 "abcd") ++ "\n" ++ "hello"</code> - that is 10000 times <code>asdf</code>, then the newline to find, and then <code>hello</code>. Of course all 3 implementations return the same offset: 40000 as expected. Interestingly, when compiled with <code>gcc -O2</code>, the run times on that string are: <ul> <li> <code>mem1</code>: 16 us</li> <li> <code>mem2</code>: 12 us</li> <li> <code>mem3</code>: 25 us</li> <li> <code>mem4</code>: 16 us</li> </ul> (I'm using the criterion library to measure these times with statistical accuracy.) I cannot explain this to myself. Why is <code>mem2</code> so much faster than the other two? -- The assembly as generated by <code>gcc -S -O2 -o check.asm check.c</code>: <pre class="prettyprint"><code>mem1: .LFB14: cmpb $10, (%rdi) movq %rdi, %rax je .L9 .L6: addq $1, %rax cmpb $10, (%rax) jne .L6 subq %rdi, %rax ret .L9: xorl %eax, %eax ret mem2: .LFB15: movq %rdi, %rax jmp .L13 .L19: cmpb $10, %dl je .L14 .L11: addq $1, %rax .L13: movzbl (%rax), %edx cmpb $36, %dl jg .L11 testb %dl, %dl jne .L19 .L14: subq %rdi, %rax ret mem3: .LFB16: movzbl (%rdi), %edx testb %dl, %dl je .L26 cmpb $10, %dl movq %rdi, %rax jne .L27 jmp .L26 .L30: cmpb $10, %dl je .L23 .L27: addq $1, %rax movzbl (%rax), %edx testb %dl, %dl jne .L30 .L23: subq %rdi, %rax ret .L26: xorl %eax, %eax ret mem4: .LFB17: cmpb $10, (%rdi) movq %rdi, %rax je .L38 .L36: addq $1, %rax cmpb $10, (%rax) jne .L36 subq %rdi, %rax ret .L38: xorl %eax, %eax ret </code></pre> Any explanation is very appreciated!

My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in <code>mem1</code>, you have a circular dependency on <code>rax</code>. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit. In <code>mem2</code> the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both <code>rax</code> and <code>edx/dl</code> - is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently. I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis. Edit: Out of interest, I've tested <code>mem1</code> and <code>mem2</code> on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's <code>time</code>, I get ~19s for <code>mem1</code> and ~18.8s for <code>mem2</code> - much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...

Why is my more complicated C loop faster?

Tags:

performance

c

assembly

I am looking at the performance of memchr-like functions and made an interesting observation.

This is check.c with 3 implementations to find the offset of a \n character in a string:

#include <stdlib.h>

size_t mem1(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x == '\n') return (p - s);
    p++;
  }
}

size_t mem2(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x <= '$' && (x == '\n' || x == '\0')) return (p - s);
    p++;
  }
}

size_t mem3(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x == '\n' || x == '\0') return (p - s);
    p++;
  }
}

size_t mem4(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x <= '$' && (x == '\n')) return (p - s);
    p++;
  }
}

I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello" - that is 10000 times asdf, then the newline to find, and then hello. Of course all 3 implementations return the same offset: 40000 as expected.

Interestingly, when compiled with gcc -O2, the run times on that string are:

mem1: 16 us
mem2: 12 us
mem3: 25 us
mem4: 16 us

(I'm using the criterion library to measure these times with statistical accuracy.)

I cannot explain this to myself. Why is mem2 so much faster than the other two?

The assembly as generated by gcc -S -O2 -o check.asm check.c:

mem1:
.LFB14:
  cmpb  $10, (%rdi)
  movq  %rdi, %rax
  je  .L9
.L6:
  addq  $1, %rax
  cmpb  $10, (%rax)
  jne .L6
  subq  %rdi, %rax
  ret
.L9:
  xorl  %eax, %eax
  ret


mem2:
.LFB15:
  movq  %rdi, %rax
  jmp .L13
.L19:
  cmpb  $10, %dl
  je  .L14
.L11:
  addq  $1, %rax
.L13:
  movzbl  (%rax), %edx
  cmpb  $36, %dl
  jg  .L11
  testb %dl, %dl
  jne .L19
.L14:
  subq  %rdi, %rax
  ret


mem3:
.LFB16:
  movzbl  (%rdi), %edx
  testb %dl, %dl
  je  .L26
  cmpb  $10, %dl
  movq  %rdi, %rax
  jne .L27
  jmp .L26
.L30:
  cmpb  $10, %dl
  je  .L23
.L27:
  addq  $1, %rax
  movzbl  (%rax), %edx
  testb %dl, %dl
  jne .L30
.L23:
  subq  %rdi, %rax
  ret
.L26:
  xorl  %eax, %eax
  ret


mem4:
.LFB17:
  cmpb  $10, (%rdi)
  movq  %rdi, %rax
  je  .L38
.L36:
  addq  $1, %rax
  cmpb  $10, (%rax)
  jne .L36
  subq  %rdi, %rax
  ret
.L38:
  xorl  %eax, %eax
  ret

Any explanation is very appreciated!

260

asked Jan 12 '14 15:01

nh2

2 Answers

My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in mem1, you have a circular dependency on rax. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit.

In mem2 the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both rax and edx/dl - is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently.

I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis.

Edit: Out of interest, I've tested mem1 and mem2 on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's time, I get ~19s for mem1 and ~18.8s for mem2 - much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...

130

answered Sep 19 '22 15:09

Notlikethat

Your input is such that makes mem2 faster. Every letter in the input apart from '\n' has value larger than '$', so if condition is false from the first part of the expression (x <= '$'), and second part of the expression (x == '\n' || x == '\0') is never executed. If you would use "####" instead of "abcd" I suspect the execution would become slower.

answered Sep 16 '22 15:09

Dialecticus

Related questions
                            
                                Open MPI - mpirun exits with error on simple program
                            
                                Is detecting unsigned wraparound via cast to signed undefined behavior?
                            
                                Is there raw linux system call API/ABI documentation
                            
                                Why does icc fail to handle compile-time branch hints in a reasonable way?
                            
                                looking for a tuple matching algorithm
                            
                                C Library for compressing sequential positive integers
                            
                                Absolute beginners guide to working with audio in C/C++?
                            
                                Change case of argument provided to C preprocessor macro
                            
                                Open source portable/cross-platform video camera capture library [closed]
                            
                                NetBeans settings for GCC
                            
                                Memory-efficient line stitching in very large images
                            
                                /usr/bin/ld: warning: abc.so, needed by xyz.so not found (try using -rpath or -rpath-link)"
                            
                                NEON vs Intel SSE - equivalence of certain operations
                            
                                Translating source code into a foreign language
                            
                                reference counts in a Python C extension
                            
                                g++ optimization options affect the value of sin function
                            
                                Must an unused volatile parameter be honoured?
                            
                                Dijkstra's Algorithm: Why is it needed to find minimum-distance element in the queue
                            
                                multithreaded epoll
                            
                                Is this method of pointer tagging in C standard-compliant?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With