While studying OSX 10.9.4's implementation of strlen, I notice that it always compares a chunk of 16-bytes and skips ahead to the following 16-bytes until it encounters a '\0
'. The relevant part:
3de0: 48 83 c7 10 add $0x10,%rdi
3de4: 66 0f ef c0 pxor %xmm0,%xmm0
3de8: 66 0f 74 07 pcmpeqb (%rdi),%xmm0
3dec: 66 0f d7 f0 pmovmskb %xmm0,%esi
3df0: 85 f6 test %esi,%esi
3df2: 74 ec je 3de0 <__platform_strlen+0x40>
0x10
is 16 bytes in hex.
When I saw that, I was wondering: this memory could just as well not be allocated. If I had allocated a C string of 20 bytes and passed it to strlen
, it would read 36 bytes of memory. Why is it allowed to do that? I started looking and found How dangerous is it to access an array out of bounds?
Which confirmed that it's definitely not always a good thing, unallocated memory might be unmapped, for example. Yet, there must be something that makes this work. Some of my hypotheses:
What's the actual reason?
EDIT: just found Why I'm getting read and write permission on unallocated memory?, which seems to indicate my first guess was right.
EDIT 2: Stupidly enough, I had forgotten that even though Apple seems to have removed the source of most of its asm implementations (Where did OSX's x86-64 assembly libc routines go?), it left strlen: http://www.opensource.apple.com/source/Libc/Libc-997.90.3/x86_64/string/strlen.s
In the comments we find:
// returns the length of the string s (i.e. the distance in bytes from
// s to the first NUL byte following s). We look for NUL bytes using
// pcmpeqb on 16-byte aligned blocks. Although this may read past the
// end of the string, because all access is aligned, it will never
// read past the end of the string across a page boundary, or even
// accross a cacheline.
EDIT: I honestly think all answerers deserved an accepted answer, and basically all contained the information necessary to understand the issue. So I went for the answer of the person that had the least reputation.
I'm the author of the routine in question.
As some others have said, the key thing is that the reads are all aligned. While reading outside the bounds of an array is undefined behavior in C, we're not writing C; we know lots of details of the x86 architecture beyond what the C abstract machine defines.
In particular, reads beyond the end of a buffer are safe (meaning they cannot produce a trap or other observable side effect) so long as they do not cross a page boundary (because memory attributes and mappings are tracked at page granularity). Since the smallest supported page size is 4096 bytes, an aligned 16 byte load cannot cross a page boundary.
Reading memory on most architectures only has a side effect if the address being read corresponds to a page that is not mapped. Most strlen
implementations for modern computers try to do only aligned reads of however-many bytes. They will never do a 16-byte read straddling two pages, and so they will never elicit any side effect. So it's cool.
How malloc
aligns stuff is irrelevant, since the programmer may allocate a string inside a block. A simple example is a struct
that has an embedded char array:
struct Foo
{
int bar;
char baz[10];
};
If you allocate an instance of this struct
, it will take up 16 bytes, but baz
will start at offset 4. Thus if you read 16 bytes from there, you will cross into the next 16 byte chunk that you don't own. If you are unlucky, that may lie in the next page and trigger a fault.
Also, strings don't have to be in the heap at all, such as constants that are in read-only data section. strlen
must work in all cases.
I assume the strlen
function first processes the initial portion of the string until it's 16 byte aligned (this code has been omitted from the question) and then proceeds in 16 byte chunks. As such, the actual reason this works is reason #2: you won't cross a page boundary which is the granularity of access checking by the processor.
Allocating Small Memory Blocks Using Malloc
...
When allocating any small blocks of memory, remember that the granularity for blocks allocated by the malloc library is 16 bytes. Thus, the smallest block of memory you can allocate is 16 bytes and any blocks larger than that are a multiple of 16. For example, if you call malloc and ask for 4 bytes, it returns a block whose size is 16 bytes; if you request 24 bytes, it returns a block whose size is 32 bytes. Because of this granularity, you should design your data structures carefully and try to make them multiples of 16 bytes whenever possible.
For the record, referencing (reading) past allocation could trigger a page fault if you cross a page boundary. This is how guardmalloc
works:
Each malloc allocation is placed on its own virtual memory page (or pages). By default, the returned address for the allocation is positioned such that the end of the allocated buffer is at the end of the last page, and the next page after that is kept unallocated. Thus, accesses beyond the end of the buffer cause a bad access error immediately.
Also read the vectorized instructions explicit reference in the same page:
As of Mac OS X 10.5, libgmalloc aligns the start of allocated buffers on 16-byte boundaries by default, to allow proper use of vector instructions (e.g., SSE). (The use of vector instructions is common, including in some Mac OS X system libraries.
ps. afaik NSObject and friends share the heap implementation with malloc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With