Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1
bytes past the end, where W
is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).
It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer1. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.
In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.
Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte boundary
input64 = (ptrdiff_t)(input64 + 1) & ~0x7;
for (; input64 < end64; input64++) {
if ((res = match(*input64)) > 0) {
return input + res < input + size ? input + res : -1;
}
}
return -1;
}
The inner function int match(uint64_t bytes)
isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.
First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8)
chunks of 8 bytes2. This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1
). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.
Practically speaking, is such a function safe on x86 and x86-64?
These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative)3.
A note for language lawyers:
Reading from a pointer beyond its allocated size is definitely not allowed in the standard. I appreciate language lawyer answers, and even occasionally write them myself, and I'll even be happy when someone digs up the chapter and verse which shows the code above is undefined behavior and hence not safe in the strictest sense (and I'll copy the details here). Ultimately though, that's not what I'm after. As a practical matter, many common idioms involving pointer conversion, structure access though such pointers and so are technically undefined, but are widespread in high quality and high performance code. Often there is no alternative, or the alternative runs at half speed or less.
If you wish, consider a modified version of this question, which is:
After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e., the compiler hasn't used a provable partially out-of-bounds access to do something really clever, is executing the compiled program safe?
In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.
All that said, while I am only moderately interested in a link to the standard showing this is UD, I am very interested in any details of actual implementations that can use this particular UD to produce unexpected code. Now I don't think this can happen without some deep pretty deep cross-procedure analysis, but the gcc overflow stuff surprised a lot of people too...
1 Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.
2 Note for this overlapping to work requires that this function and match()
function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match()
calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.
3 It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok
which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match()
, such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).
If you permit consideration of non-CPU devices, then one example of a potentially unsafe operation is accessing out-of-bounds regions of PCI-mapped memory pages. There's no guarantee that the target device is using the same page size or alignment as the main memory subsystem. Attempting to access, for example, address [cpu page base]+0x800
may trigger a device page fault if the device is in a 2KiB page mode. This will usually cause a system bugcheck.
Yes, it's safe in x86 asm, and existing libc strlen(3)
implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias))
typedef instead of plain unsigned long
as your wider type, like __m128i
etc. already use.
It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundaries1Any naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback
. See the section below about that for more detail.
It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.
I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with @RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).
This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.
There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.
The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t
are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)
Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.
Or similarly a code checker like valgrind could complain about reading outside an object.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).
If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*
. This sort of situation doesn't happen for normal code.
strlen
is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0
byte, you can only read one byte at a time.
For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub
(min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb
against a vector of all-zero, and pmovmskb
/ bsf
to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.
Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.
Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.
If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.
(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)
So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.
Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.
But even if it is visible at compile-time, it generally doesn't hurt with current compilers.
PS: a previous version of this answer claimed that unaligned deref of int *
was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a typedef with GNU C __attribute__((aligned(1),may_alias))
, or memcpy
, to make that safe. The may_alias
part isn't needed if you only access it via signed/unsigned int*
and/or `char*, i.e. in ways that wouldn't violate the normal C strict-aliasing rules.
The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*
), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
This is useful for the first vector of strlen; after this you can p = (p+16) & -16
to go to the next aligned vector. This will partially overlap if p
was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.
e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0
(LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF
(LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.
Or more simply, p & 4095 > (4096 - 16)
(MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth)
checks that the offset-within-page is far enough from the end of a page.
You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int
instead of uintptr_t
, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p
. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20)
(MOV / SHL / CMP), because shl reg, 20
is 3 bytes, vs. and eax, imm32
being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff
.)
If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias));
to make it safe to do unaligned accesses.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With