I'm investigating a minimal opcode size x86-64 strlen implementation for my code golfing / binary executable that is not supposed to exceed some size (think of demoscene for simplicity).
General idea comes from here, size optimization ideas from here and here.
Input string address is in rdi
, max length should be not bigger than Int32
xor eax,eax ; 2 bytes
or ecx,-1 ; 3 bytes
repne scasb ; 2 bytes
not ecx ; 2 bytes
dec ecx ; 2 bytes
Final result is in ecx
in 11 bytes total.
The question is about setting ecx
to -1
Option 1 already stated
or ecx,-1 ; 3 bytes
Option 2
lea ecx,[rax-1] ; 3 bytes
Option 3
stc ; 1 byte
sbb ecx,ecx ; 2 bytes
Option 4 , probably the slowest one
push -1 ; 2 bytes
pop rcx ; 1 byte
I understand that:
Option 1 has dependency on previous ecx
value
Option 2 has dependency on previous rax
value
Option 3 I'm not sure if it has dependency on previous ecx
value?
Option 4 is the slowest one?
Is there a clear winner here?
The criteria is to keep the opcodes size as small as possible and choose the best one performance wise.
I'm fully aware there are implementations using modern cpu instructions, but this legacy approach seems the smallest one.
For a hacky good-enough version, we know rdi
has a valid address. It's very likely that edi
is not a small integer, thus 2 byte mov ecx, edi
. But this isn't safe is RDI could be pointing just past a 4GiB boundary so it's hard to prove it's safe. Unless you're using an ILP32 ABI like x32 so all pointers are below the 4GiB mark.
So you might need to copy the full RDI with push rdi / pop rcx, 1 byte each. But that adds extra latency for short-string startup. It should be safe if you have no strings with length higher than their starting address. (But that is plausible for static storage in .data, .bss, or .rodata if you have any huge arrays; for example Linux non-PIE executables are loaded at around 0x401000
= 1<<22.)
This is great if you just want rdi pointing to the terminating 0
byte, instead of actually needing a count. Or if you have the start pointer in another register so you can do sub edi, edx
or something and get the length that way instead of processing the rcx
result. (If you know the result fits in 32 bits, you don't need sub rdi, rdx
because you know the upper bits of that would be zero anyway. And high input bits don't affect low output bits for add/sub; carry propagates left to right.)
For strings known to be under 255 bytes, you can use mov cl, -1
(2 bytes). That makes rcx
at least 0xFF, and higher depending on what high garbage was left in it. (This has a partial-reg stall on Nehalem and earlier when RCX is read, otherwise just a dependency on the old RCX). Anyway, then mov al, -2
/ sub al, cl
to get the length as an 8-bit integer. This may or may not be useful.
Depending on the caller, rcx
might have already been holding a pointer value, in which case you could leave it untouched if you can use pointer subtraction.
lea ecx,[rax-1]
is very good because you just xor-zeroed eax
, and it's a cheap 1 uop instruction with 1 cycle latency and can run on multiple execution ports on all mainstream CPUs.
When you already have another register with a known constant value, especially one that's xor-zeroed, 3-byte lea
almost always the most efficient 3-byte way to create a constant, if it works. (See Set all bits in CPU register to 1 efficiently).
I'm fully aware there are implementations using modern cpu instructions, but this legacy approach seems the smallest one.
Yes, repne scasb
is very compact. Its startup overhead is maybe something like 15 cycles on a typical Intel CPU, and according to Agner Fog, it issues >=6n uops with a throughput of >= 2n cycles, where n
is the count (i.e. 2 cycles per byte it compares for long compares, where the startup overhead is hidden), so it dwarfs the cost of lea
.
Something with a false dependency on ecx
could delay its startup, so you definitely want lea
.
repne scasb
is probably fast enough for whatever you're doing, but it's slower than pcmpeqb
/ pmovmsbk
/ cmp
. For short fixed-length strings, integer cmp
/ jne
is very good when the length is 4 or 8 bytes (including the terminating 0), assuming you can safely over-read your strings, i.e. you don't have to worry about ""
at the end of a page. This method has overhead that scales with string length, though. For string length=7, for example, you could do 4, 2, and 1 operand sizes, or you could do two dword compares overlapping by 1 byte. like cmp dword [rdi], first_4_bytes / jne
; cmp dword [rdi+3], last_4_bytes / jne
.
More details about LEA
On a Sandybridge-family CPU, the lea
could be dispatched to an execution unit in the same cycle as it and the xor
-zero were issued into the out-of-order CPU core. xor
-zeroing is handled in the issue/rename stage, so the uop enters the ROB in an "already-executed" state. It's impossible for the instruction to ever have to wait for RAX. (Unless an interrupt happens between the xor and the lea
, but even then I think there'd be a serializing instruction after restoring RAX and before the lea
could execute, so it couldn't be stuck waiting.)
Simple lea
can run on port0 or port1 on SnB, or port1 / port5 on Skylake (2 per clock throughput, but sometimes different ports on different SnB-family CPUs). It's 1 cycle latency, so it's hard to do much better.
It's unlikely you'd see any speedup from using mov ecx, -1
(5 bytes) which can run on any ALU port.
On an AMD Ryzen, lea r32, [m]
in 64-bit mode is treated as a "slow" LEA that can only run on 2 ports, and has 2c latency instead of 1. Worse, Ryzen doesn't eliminate xor-zeroing.
The microbenchmark test you did only measures throughput for the versions with no false dependencies, not latency. This is often a useful measure, and you did happen to get the right answer that lea
is the best choice.
Whether pure throughput accurately reflects anything about your real use case is another matter. You might actually depend on the latency, not throughput, if the string compare is on the critical path as part of a long or loop-carried data dependency chain not broken by a jcc
to give you branch-prediction + speculative execution. (But branchless code is often larger, so this is unlikely).
stc
/ sbb ecx,ecx
is interesting, but only AMD CPUs treat sbb
as dependency-breaking (only depending on CF, not the integer register). On Intel Haswell and earlier, sbb
is a 2 uop instruction (because it has 3 inputs: 2 GP integer + flags). It has 2c latency, which is why it performs so badly. (The latency is a loop-carried dep chain.)
Depending what you're doing, you may be able to use strlen+2
just as well, but offsetting another constant or something. dec ecx
is only 1 byte in 32-bit code, but x86-64 doesn't have the short-form inc/dec
instructions. So not / dec isn't as cool in 64-bit code.
After repne scas
, you have ecx = -len - 2
(if you started with ecx = -1
), and not
gives you -x-1
(i.e. +len + 2 - 1
).
; eax = 0
; ecx = -1
repne scasb ; ecx = -len - 2
sub eax, ecx ; eax = +len + 2
I did some benchmarking on Intel Core i7 4850HQ Haswell 2,3 GHz, release build no debugger attached. In each loop I measure 1000 sequences of asm instructions and repeat it 10 milion times to average result.
I've made macros for repeating asm instructions 100 times.
#define lea100 asm{xor eax,eax};asm { lea ecx,[rax-1] }; // <== Copy pasted 100times
#define or100 asm{xor eax,eax};asm { or ecx,-1 }; // <== Copy pasted 100times
#define sbb100 asm{xor eax,eax};asm { stc };asm{sbb ecx,ecx}; // <== Copy pasted 100times
#define stack100 asm ("xor %eax,%eax;.byte 0x6A; .byte 0xFF ;pop %rcx;"); // <== Copy pasted 100times
Testing C code with inline asm for MacOS
#include <stdio.h>
#include <CoreServices/CoreServices.h>
#include <mach/mach.h>
#include <mach/mach_time.h>
int main(int argc, const char * argv[]) {
uint64_t start;
uint64_t end;
uint64_t elapsed;
Nanoseconds elapsedNano;
uint64_t sum = 0;
for (int i = 0; i < 10000000 ; i++) {
// this will become
// call imp___stubs__mach_absolute_time
// mov r14, rax
start = mach_absolute_time();
//10x lea100 for example for total 1000
// call imp___stubs__mach_absolute_time
// sub rax, r14
end = mach_absolute_time();
elapsed = end - start;
elapsedNano = AbsoluteToNanoseconds( *(AbsoluteTime *) &elapsed );
uint64_t nano = * (uint64_t *) &elapsedNano;
sum += nano;
}
printf("%f\n",sum/10000000.0);
return 0;
}
Results
xor eax,eax
lea ecx,[rax-1]
205-216 ns
xor eax,eax
or ecx,-1
321-355 ns
xor eax,eax
push -1
pop rcx
322-359 ns
xor eax,eax
stc
sbb ecx,ecx
612-692 ns
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With