Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it well-defined to hold a misaligned pointer, as long as you don't ever dereference it?

I have some C code that parses packed/unpadded binary data that comes in from the network.

This code was/is working fine under Intel/x86, but when I compiled it under ARM it would often crash.

The culprit, as you might have guessed, was unaligned pointers -- in particular, the parsing code would do questionable things like this:

uint8_t buf[2048]; [... code to read some data into buf...] int32_t nextWord = *((int32_t *) &buf[5]);  // misaligned access -- can crash under ARM! 

... that's obviously not going to fly in ARM-land, so I modified it to look more like this:

uint8_t buf[2048]; [... code to read some data into buf...] int32_t * pNextWord = (int32_t *) &buf[5]; int32 nextWord; memcpy(&nextWord, pNextWord, sizeof(nextWord));  // slower but ARM-safe 

My question (from a language-lawyer perspective) is: is my "ARM-fixed" approach well-defined under the C language rules?

My worry is that maybe even just having a misaligned-int32_t-pointer might be enough to invoke undefined behavior, even if I never actually dereference it directly. (If my concern is valid, I think I could fix the problem by changing pNextWord's type from (const int32_t *) to (const char *), but I'd rather not do that unless it's actually necessary to do so, since it would mean doing some pointer-stride arithmetic by hand)

like image 237
Jeremy Friesner Avatar asked Jul 06 '18 05:07

Jeremy Friesner


2 Answers

To safely parse multi-byte integer across compilers/platforms, you can extract each byte, and assemble them to integer according to the endian. For example, to read 4-byte integer from big-endian buffer:

uint8_t* buf = any address;  uint32_t val = 0; uint32_t  b0 = buf[0]; uint32_t  b1 = buf[1]; uint32_t  b2 = buf[2]; uint32_t  b3 = buf[3];  val = (b0 << 24) | (b1 << 16) | (b2 << 8) | b3; 
like image 34
lee qiaoping Avatar answered Sep 26 '22 08:09

lee qiaoping


No, the new code still has undefined behaviour. C11 6.3.2.3p7:

  1. A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned 68) for the referenced type, the behavior is undefined. [...]

It doesn't say anything about dereferencing the pointer - even the conversion has undefined behaviour.


Indeed, the modified code that you assume is ARM-safe might not be even Intel-safe. Compilers are known to generate code for Intel that can crash on unaligned access. While not in the linked case, it might just be that a clever compiler can take the conversion as a proof that the address is indeed aligned and use a specialized code for memcpy.


Alignment aside, your first excerpt also suffers from strict aliasing violation. C11 6.5p7:

  1. An object shall have its stored value accessed only by an lvalue expression that has one of the following types:88)
    • a type compatible with the effective type of the object,
    • a qualified version of a type compatible with the effective type of the object,
    • a type that is the signed or unsigned type corresponding to the effective type of the object,
    • a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
    • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
    • a character type.

Since the array buf[2048] is statically typed, each element being char, and therefore the effective types of the elements are char; you may access the contents of the array only as characters, not as int32_ts.

I.e., even

int32_t nextWord = *((int32_t *) &buf[_Alignof(int32_t)]); 

has undefined behaviour.

like image 117