How Dangerous is This Faster `strlen`?

Question

strlen is a fairly simple function, and it is obviously O(n) to compute. However, I have seen a few approaches that operate on more than one character at a time. See example 5 here or this approach here. The basic way these work is by reinterpret-casting the char const* buffer to a uint32_t const* buffer and then checking four bytes at a time.

Personally, my gut reaction is that this is a segfault-waiting-to-happen, since I might dereference up to three bytes outside valid memory. However, this solution seems to hang around, and it seems curious to me that something so obviously broken has stood the test of time.

I think this comprises UB for two reasons:

Potential dereference outside valid memory
Potential dereference of unaligned pointer

(_{^{Note that there is not an aliasing issue; one might think the uint32_t is aliased as an incompatible type, and code after the strlen (such as code that might change the string) could run out of order to the strlen, but it turns out that char is an explicit exception to strict aliasing}}).

But, how likely is it to fail in practice? At minimum, I think there needs to be 3 bytes padding after the string literal data section, malloc needs to be 4-byte or larger aligned (actually the case on most systems), and malloc needs to allocate 3 extra bytes. There are other criteria related to aliasing. This is all fine for compiler implementations, which create their own environments, but how frequently are these conditions met on modern hardware for user code?

Kaz · Accepted Answer

The technique is valid, and you will not avoid it if you call our C library strlen. If that library is, for instance, a recent version of the GNU C library (at least on certain targets), it does the same thing.

The key to make it work is to ensure that the pointer is aligned properly. If the pointer is aligned, the operation will read beyond the end of the string surely enough, but not into an adjacent page. If the null terminating byte is within one word of the end of a page, then that last word will be accessed without touching the subsequent page.

It certainly isn't well-defined behavior in C, and so it carries the burden of careful validation when ported from one compiler to another. It also triggers false positives from out-of-bounds access detectors like Valgrind.

Valgrind had to be patched to work around Glibc doing this. Without the patches, you get nuisance errors such as this:

==13669== Invalid read of size 8
==13669==    at 0x411D6D7: __wcslen_sse2 (wcslen-sse2.S:59)
==13669==    by 0x806923F: length_str (lib.c:2410)
==13669==    by 0x807E61A: string_out_put_string (stream.c:997)
==13669==    by 0x8075853: obj_pprint (lib.c:7103)
==13669==    by 0x8084318: vformat (stream.c:2033)
==13669==    by 0x8081599: format (stream.c:2100)
==13669==    by 0x408F4D2: (below main) (libc-start.c:226)
==13669==  Address 0x43bcaf8 is 56 bytes inside a block of size 60 alloc'd
==13669==    at 0x402BE68: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==13669==    by 0x8063C4F: chk_malloc (lib.c:1763)
==13669==    by 0x806CD79: sub_str (lib.c:2653)
==13669==    by 0x804A7E2: sysroot_helper (txr.c:233)
==13669==    by 0x408F4D2: (below main) (libc-start.c:226)

Glibc is using SSE instructions to do calculate wcslen eight bytes at a time (instead of four, the width of wchar_t). In doing so, it is accessing at offset 56 in a block that is 60 bytes wide. However, note that this access could never straddle a page boundary: the address is divisible by 8.

If you're working in assembly language, you don't have to think twice about the technique.

In fact, the technique is used quite bit in some optimized audio codecs that I work with (targetting ARM), which feature a lot of hand-written assembly language in the Neon instruction set.

I noticed it when running Valgrind on code which integrated these codecs, and contacted the vendor. They explained that it was just a harmless loop optimization technique; I went through the assembly language and convinced myself they were right.

nneonneo · Answer

(1) can definitely happen. There's nothing preventing you from taking the strlen of a string near the end of an allocated page, which could result in an access past the end of allocated memory and a nice big crash. As you note, this could be mitigated by padding all your allocations, but then you have to have any libraries do the same. Worse, you have to arrange for the linker and OS to always add this padding (remember the OS passes argv[] in a static memory buffer somewhere). The overhead of doing this isn't worth it.

(2) also definitely happens. Earlier versions of ARM processors generate data aborts on unaligned accesses, which either cause your program to die with a bus error (or halt the CPU if you're running bare-metal), or force a very expensive trap through the kernel to handle the unaligned access. These earlier ARM chips are still in wide use in older cellphones and embedded devices. Later ARM processors synthesize multiple word accesses to deal with unaligned accesses, but this will result in overall slower performance since you basically double the number of memory loads you need to do.

Many current ("modern") PICs and embedded microprocessors lack the logic to handle unaligned accesses, and may behave unpredictably or even nonsensically when given unaligned addresses (I've personally seen chips that will just mask off the bottom bits, which would give incorrect answers, and others that will just give garbage results with unaligned accesses).

So, this is ridiculously dangerous to use in anything that should be remotely portable. Please, please, please do not use this code; use the libc strlen. It will usually be faster (optimized for your platform properly) and will make your code portable. The last thing you want is for your code to subtly and unexpectedly break in some situation (string near the end of an allocation) or on some new processor.

rocky · Answer

Donald Knuth, a person who wrote 3+ volumes on clever algorithms said: "Premature optimization is the root of all evil".

strlen() is used a lot, so it really should be fast. Riffing on wildplasser's remark, "I would trust the library function", what makes you think that the library function works byte at a time? Or is slow?

The title may give folks the impression that the code you suggest is faster than the standard system library strlen(), but I think what you mean is that it is faster than a naive strlen() which probably doesn't get used, anyway.

I compiled a simple C program and looked on my 64-bit system which uses GNU's glibc function. The code I saw was pretty sophisticated and looks pretty fast in terms of working with register width rather than byte at a time. The code I saw for strlen() is written in assembly language so there probably aren't junk instructions as you might get if this were compiled C code. What I saw was rtld-strlen.S. This code also unrolls loops to reduce the overhead in looping.

Before you think you can do better on strlen, you should look at that code, or the corresponding code for your particular architecture, and register size.

And if you do write your own strlen, benchmark it against the existing implementation.

And obviously, if you use the system strlen, then it is probably correct and you don't have to worry about invalid memory references as a result of an optimization in the code.

How Dangerous is This Faster `strlen`?

Tags:

c

imallett

3 Answers

Kaz

nneonneo

rocky

Recent Activity

Donate For Us

How Dangerous is This Faster `strlen`?

Tags:

c

imallett

3 Answers

Kaz

nneonneo

rocky

Related questions

Recent Activity

Donate For Us