According to the C standard:
When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object (sect. 6.5.6 1173)
[Note: do not assume that I know much of the standard or UB, I just happen to have found out this one]
Now on the other hand
Hence my question: from experiment, it seems that on some architectures (e.g. x86-64), pointer difference between two arrays provides sensible, reproducible results. And it seems to correspond reasonably well to the hardware of these architectures. So does some implementation actually insure a specific behavior?
For example, is there an implementation out there in the wild that guarantees for a
and b
being char*
, we have a + (reinterpret_cast<std::ptrdiff_t>(b)-reinterpret_cast<std::ptrdiff_t>(a)) == b
?
Array in C is used to store elements of same types whereas Pointers are address varibles which stores the address of a variable. Now array variable is also having a address which can be pointed by a pointer and array can be navigated using pointer.
There is a close relationship between array and pointer. There is a basic difference between an array and pointer is that an array is a collection of variables of a similar data type. In contrast, the pointer is a variable which is used for storing the address of another variable.
An array is a collection of elements of similar data type whereas the pointer is a variable that stores the address of another variable. An array size decides the number of variables it can store whereas; a pointer variable can store the address of only one variable in it.
A user creates a pointer for storing the address of any given array. A user creates an array of pointers that basically acts as an array of multiple pointer variables. It is alternatively known as an array pointer. These are alternatively known as pointer arrays.
Why make it UB, and not implementation-defined? (where of course, for some architectures, implementation-defined will specify it as UB)
That is not how it works.
If something is documented as "implementation-defined" by the standard, then any conforming implementation is expected to define a behavior for that case, and document it. Leaving it undefined is not an option.
As labeling pointer difference between unrelated arrays "implementation defined" would leave e.g. segmented or Harvard architectures with no way to have a fully-conforming implementation, this case remains undefined by the standard.
Implementations could offer a defined behavior as a non-standard extension. But any program making use of such an extension would no longer be strictly conforming, and non-portable.
Any implementation is free to document a behaviour for which the standard does not require behaviour to be documented - it is well within the limits of the standard. The problem with implementation-defined behaviour in this case is that the implementations must then carefully document them, and when C was standardized, the committee presumably found out that the different implementations were so wildly variable, that no sensible common ground would exist, so they decided to make it UB altogether.
I do not know any compilers that do make it defined, but I know a compiler which does explicitly keep it undefined, even if you try to cheat with casts:
When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 and C11 6.5.6/8.
I believe another compiler also has the same behaviour, though, unfortunately it doesn't document it in an accessible way.
That those two compilers do not define it would be a good reason to avoid depending on it in any programs, even if compiled with another compiler that would specify a behaviour, because you can never be too sure what compiler you need to use 5 years from now...
The more implementation-defined behavior you have and someone's code depends on, the less portable that code is. In this case, there's already an implementation-defined way out of this: reinterpret_cast
the pointers to integers and do your math there. That makes it clear to everyone that you're relying on behavior specific to the implementation (or at least, behavior that may not be portable everywhere).
Plus, while the runtime environment may in fact be "all objects are stored in a big array starting at approx. 0 and ending at approx. memory size," that is not true of the compile-time behavior. At compile-time, you can get pointers to objects and do pointer arithmetic on them. But treating such pointers as just addresses into memory could allow a user to start indexing into compiler data and such. By making such things UB, it makes it expressly forbidden at compile-time (and reinterpret_cast
is explicitly disallowed at compile-time).
One big reason for saying that things are UB is to allow the compiler to perform optimizations. If you want to allow such a thing, then you remove some optimizations. And as you say, this is only (if even then) useful in some small corner cases. I would say that in most cases where this might seem like a viable option, you should instead reconsider your design.
From comments below:
I agree but the problem it that while I can reconsider my design, I can't reconsider the design of other libraries..
It is very rare that the standard adopts to such things. It has happened however. That's the reason why int *p = 0
is perfectly valid, even though p
is a pointer and 0
is an int
. This made it in the standard because it was so commonly used instead of the more correct int *p = NULL
. But in general, this does not happen, and for good reasons.
First, I feel like we need to get some terms straight, at least with respect to C.
From the C2011 online draft:
Undefined behavior - behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements. Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
Unspecified behavior - use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance. An example of unspecified behavior is the order in which the arguments to a function are evaluated.
Implementation-defined behavior - unspecified behavior where each implementation documents how the choice is made. An example of implementation-defined behavior is the propagation of the high-order bit when a signed integer is shifted right.
The key point above is that unspecified behavior means that the language definition provides multiple values or behaviors from which the implementation may choose, and there are no further requirements on how that choice is made. Unspecified behavior becomes implementation-defined behavior when the implementation documents how it makes that choice.
This means that there are restrictions on what may be considered implementation-defined behavior.
The other key point is that undefined does not mean illegal, it only means unpredictable. It means you've voided the warranty, and anything that happens afterwards is not the responsibility of the compiler implementation. One possible outcome of undefined behavior is to work exactly as expected with no nasty side effects. Which, frankly, is the worst possible outcome, because it means as soon as something in the code or environment changes, everything could blow up and you have no idea why (been in that movie a few times).
Now to the question at hand:
I also know that on some architectures ("segmented machine" as I read somewhere), there are good reasons that the behavior is undefined.
And that's why it's undefined everywhere. There are some architectures still in use where different objects can be stored in different memory segments, and any differences in their addresses would be meaningless. There are just so many different memory models and addressing schemes that you cannot hope to define a behavior that works consistently for all of them (or the definition would be so complicated that it would be difficult to implement).
The philosophy behind C is to be maximally portable to as many architectures as possible, and to do that it imposes as few requirements on the implementation as possible. This is why the standard arithmetic types (int
, float
, etc.) are defined by the minimum range of values that they can represent with a minimum precision, not by the number of bits they take up. It's why pointers to different types may have different sizes and alignments.
Adding language that would make some behaviors undefined on this list of architectures vs. unspecified on that list of architectures would be a headache, both for the standards committee and various compiler implementors. It would mean adding a lot of special-case logic to compilers like gcc
, which could make it less reliable as a compiler.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With