Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is pointer tagging in C undefined according to the standard?

Some dynamically-typed languages use pointer tagging as a quick way to identify or narrow down the runtime type of the value being represented. A classic way to do this is to convert pointers to a suitably sized integer, and add a tag value over the least significant bits which are assumed to be zero for aligned objects. When the object needs to be accessed, the tag bits are masked away, the integer is converted to a pointer, and the pointer is dereferenced as normal.

This by itself is all in order, except it all hinges on one colossal assumption: that the aligned pointer will convert to an integer guaranteed to have zero bits in the right places.

Is it possible to guarantee this according to the letter of the standard?


Although standard section 6.3.2.3 (references are to the C11 draft) says that the result of a conversion from pointer to integer is implementation-defined, what I'm wondering is whether the pointer arithmetic rules in 6.5.2.1 and 6.5.6 effectively constrain the result of pointer->integer conversion to follow the same predictable arithmetic rules that many programs already assume. (6.3.2.3 note 67 seemingly suggests that this is the intended spirit of the standard anyway, not that that means much.)

I'm specifically thinking of the case where one might allocate a large array to act as a heap for the dynamic language, and therefore the pointers we're talking about are to elements of this array. I'm assuming that the start of the C-allocated array itself can be placed at an aligned position by some secondary means (by all means discuss this too though). Say we have an array of eight-byte "cons cells"; can we guarantee that the pointer to any given cell will convert to an integer with the lowest three bits free for a tag?

For instance:

typedef Cell ...; // such that sizeof(Cell) == 8
Cell heap[1024];  // such that ((uintptr_t)&heap[0]) & 7 == 0

((char *)&heap[11]) - ((char *)&heap[10]); // == 8
(Cell *)(((char *)&heap[10]) + 8);         // == &heap[11]
&(&heap[10])[0];                           // == &heap[10]
0[heap];                                   // == heap[0]

// So...
&((char *)0)[(uintptr_t)&heap[10]];        // == &heap[10] ?
&((char *)0)[(uintptr_t)&heap[10] + 8];    // == &heap[11] ?

// ...implies?
(Cell *)((uintptr_t)&heap[10] + 8);        // == &heap[11] ?

(If I understand correctly, if an implementation provides uintptr_t then the undefined behaviour hinted at in 6.3.2.3 paragraph 6 is irrelevant, right?)

If all of these hold, then I would assume that it means that you can in fact rely on the low bits of any converted pointer to an element of an aligned Cell array to be free for tagging. Do they && does it?

(As far as I'm aware this question is hypothetical since the normal assumption holds for common platforms anyway, and if you found one where it didn't, you probably wouldn't want to look to the C standard for guidance rather than the platform docs; but that's beside the point.)

like image 986
Leushenko Avatar asked Sep 02 '13 17:09

Leushenko


2 Answers

This by itself is all in order, except it all hinges on one colossal assumption: that the aligned pointer will convert to an integer guaranteed to have zero bits in the right places.

Is it possible to guarantee this according to the letter of the standard?

It's possible for an implementation to guarantee this. The result of converting a pointer to an integer is implementation-defined, and an implementation can define it any way it likes, as long as it meets the standard's requirements.

The standard absolutely does not guarantee this in general.

A concrete example: I've worked on a Cray T90 system, which had a C compiler running under a UNIX-like operating system. In the hardware, an address is a 64-bit word containing the address of a 64-bit word; there were no hardware byte addresses. Byte pointers (void*, char*) were implemented in software by storing a 3-bit offset in the otherwise unused high-order 3 bits of a 64-bit word pointer.

All pointer-to-pointer, pointer-to-integer, and integer-to-pointer conversions simply copied the representation.

Which means that a pointer to an 8-byte aligned object, when converted to an integer, could have any bit pattern in its low-order 3 bits.

Nothing in the standard forbids this.

The bottom line: A scheme like the one you describe, that plays games with pointer representations, can work if you make certain assumptions about how the current system represents pointers -- as long as those assumptions happen to be valid for the current system.

But no such assumptions can be 100% reliable, because the standard says nothing about how pointers are represented (other than that they're of a fixed size for each pointer type, and that the representation can be viewed as an array of unsigned char).

(The standard doesn't even guarantee that all pointers are the same size.)

like image 134
Keith Thompson Avatar answered Nov 15 '22 21:11

Keith Thompson


You're right about the relevant parts of the standard. For reference:

An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.

Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.

Since the conversions are implementation defined (except when the integer type is too small, in which case it's undefined), there's nothing the standard is going to tell you about this behaviour. If your implementation makes the guarantees you want, you're set. Otherwise, too bad.

I guess the answer to your explicit question:

Is it possible to guarantee this according to the letter of the standard?

Is "yes", since the standard punts on this behaviour and says the implementation has to define it. Arguably, "no" is just as good an answer for the same reason.

like image 30
Carl Norum Avatar answered Nov 15 '22 21:11

Carl Norum