According to the C standard: <blockquote> When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object (sect. 6.5.6 1173) </blockquote> [Note: do not assume that I know much of the standard or UB, I just happen to have found out this one] <ol> <li>I understand that in almost all cases, taking the difference of pointers in two different arrays would be a bad idea anyway.</li> <li>I also know that on some architectures ("segmented machine" as I read somewhere), there are good reasons that the behavior is undefined.</li> </ol> Now on the other hand <ol start="3"> <li>It may useful in some corner cases. For example, in this post, it would allow to use a library interface with different arrays, instead of copying everything in one array that will be split just after.</li> <li>It seems that on "ordinary" architectures, the way of thinking "all objects are stored in a big array starting at approx. 0 and ending at approx. memory size" is a reasonable description of the memory. When you actually do look at pointer differences of different arrays, you get sensible results.</li> </ol> Hence my question: from experiment, it seems that on some architectures (e.g. x86-64), pointer difference between two arrays provides sensible, reproducible results. And it seems to correspond reasonably well to the hardware of these architectures. So does some implementation actually insure a specific behavior? For example, is there an implementation out there in the wild that guarantees for <code>a</code> and <code>b</code> being <code>char*</code>, we have <code>a + (reinterpret_cast<std::ptrdiff_t>(b)-reinterpret_cast<std::ptrdiff_t>(a)) == b</code>?

<blockquote> Why make it UB, and not implementation-defined? (where of course, for some architectures, implementation-defined will specify it as UB) </blockquote> That is not how it works. If something is documented as "implementation-defined" by the standard, then any conforming implementation is expected to define a behavior for that case, and document it. Leaving it undefined is not an option. As labeling pointer difference between unrelated arrays "implementation defined" would leave e.g. segmented or Harvard architectures with no way to have a fully-conforming implementation, this case remains undefined by the standard. Implementations could offer a defined behavior as a non-standard extension. But any program making use of such an extension would no longer be strictly conforming, and non-portable.

Any implementation is free to document a behaviour for which the standard does not require behaviour to be documented - it is well within the limits of the standard. The problem with implementation-defined behaviour in this case is that the implementations must then carefully document them, and when C was standardized, the committee presumably found out that the different implementations were so wildly variable, that no sensible common ground would exist, so they decided to make it UB altogether. <hr> I do not know any compilers that do make it defined, but I know a compiler which does explicitly keep it undefined, even if you try to cheat with casts: <blockquote> When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 and C11 6.5.6/8. </blockquote> I believe another compiler also has the same behaviour, though, unfortunately it doesn't document it in an accessible way. That those two compilers do not define it would be a good reason to avoid depending on it in any programs, even if compiled with another compiler that would specify a behaviour, because you can never be too sure what compiler you need to use 5 years from now...

Is pointer difference between two arrays defined on specific implementations?

Tags:

c++

c

language-lawyer

pointer-arithmetic

According to the C standard:

When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object (sect. 6.5.6 1173)

[Note: do not assume that I know much of the standard or UB, I just happen to have found out this one]

I understand that in almost all cases, taking the difference of pointers in two different arrays would be a bad idea anyway.
I also know that on some architectures ("segmented machine" as I read somewhere), there are good reasons that the behavior is undefined.

Now on the other hand

It may useful in some corner cases. For example, in this post, it would allow to use a library interface with different arrays, instead of copying everything in one array that will be split just after.
It seems that on "ordinary" architectures, the way of thinking "all objects are stored in a big array starting at approx. 0 and ending at approx. memory size" is a reasonable description of the memory. When you actually do look at pointer differences of different arrays, you get sensible results.

Hence my question: from experiment, it seems that on some architectures (e.g. x86-64), pointer difference between two arrays provides sensible, reproducible results. And it seems to correspond reasonably well to the hardware of these architectures. So does some implementation actually insure a specific behavior?

For example, is there an implementation out there in the wild that guarantees for a and b being char*, we have a + (reinterpret_cast<std::ptrdiff_t>(b)-reinterpret_cast<std::ptrdiff_t>(a)) == b?

953

asked Sep 15 '20 15:09

Bérenger

5 Answers

Why make it UB, and not implementation-defined? (where of course, for some architectures, implementation-defined will specify it as UB)

That is not how it works.

If something is documented as "implementation-defined" by the standard, then any conforming implementation is expected to define a behavior for that case, and document it. Leaving it undefined is not an option.

As labeling pointer difference between unrelated arrays "implementation defined" would leave e.g. segmented or Harvard architectures with no way to have a fully-conforming implementation, this case remains undefined by the standard.

Implementations could offer a defined behavior as a non-standard extension. But any program making use of such an extension would no longer be strictly conforming, and non-portable.

143

answered Oct 29 '22 22:10

DevSolar

Any implementation is free to document a behaviour for which the standard does not require behaviour to be documented - it is well within the limits of the standard. The problem with implementation-defined behaviour in this case is that the implementations must then carefully document them, and when C was standardized, the committee presumably found out that the different implementations were so wildly variable, that no sensible common ground would exist, so they decided to make it UB altogether.

I do not know any compilers that do make it defined, but I know a compiler which does explicitly keep it undefined, even if you try to cheat with casts:

When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 and C11 6.5.6/8.

I believe another compiler also has the same behaviour, though, unfortunately it doesn't document it in an accessible way.

That those two compilers do not define it would be a good reason to avoid depending on it in any programs, even if compiled with another compiler that would specify a behaviour, because you can never be too sure what compiler you need to use 5 years from now...

answered Oct 29 '22 22:10

Antti Haapala -- Слава Україні

The more implementation-defined behavior you have and someone's code depends on, the less portable that code is. In this case, there's already an implementation-defined way out of this: reinterpret_cast the pointers to integers and do your math there. That makes it clear to everyone that you're relying on behavior specific to the implementation (or at least, behavior that may not be portable everywhere).

Plus, while the runtime environment may in fact be "all objects are stored in a big array starting at approx. 0 and ending at approx. memory size," that is not true of the compile-time behavior. At compile-time, you can get pointers to objects and do pointer arithmetic on them. But treating such pointers as just addresses into memory could allow a user to start indexing into compiler data and such. By making such things UB, it makes it expressly forbidden at compile-time (and reinterpret_cast is explicitly disallowed at compile-time).

answered Oct 29 '22 20:10

Nicol Bolas

One big reason for saying that things are UB is to allow the compiler to perform optimizations. If you want to allow such a thing, then you remove some optimizations. And as you say, this is only (if even then) useful in some small corner cases. I would say that in most cases where this might seem like a viable option, you should instead reconsider your design.

From comments below:

I agree but the problem it that while I can reconsider my design, I can't reconsider the design of other libraries..

It is very rare that the standard adopts to such things. It has happened however. That's the reason why int *p = 0 is perfectly valid, even though p is a pointer and 0 is an int. This made it in the standard because it was so commonly used instead of the more correct int *p = NULL. But in general, this does not happen, and for good reasons.

answered Oct 29 '22 21:10

klutt

First, I feel like we need to get some terms straight, at least with respect to C.

From the C2011 online draft:

Undefined behavior - behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements. Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
Unspecified behavior - use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance. An example of unspecified behavior is the order in which the arguments to a function are evaluated.
Implementation-defined behavior - unspecified behavior where each implementation documents how the choice is made. An example of implementation-defined behavior is the propagation of the high-order bit when a signed integer is shifted right.

The key point above is that unspecified behavior means that the language definition provides multiple values or behaviors from which the implementation may choose, and there are no further requirements on how that choice is made. Unspecified behavior becomes implementation-defined behavior when the implementation documents how it makes that choice.

This means that there are restrictions on what may be considered implementation-defined behavior.

The other key point is that undefined does not mean illegal, it only means unpredictable. It means you've voided the warranty, and anything that happens afterwards is not the responsibility of the compiler implementation. One possible outcome of undefined behavior is to work exactly as expected with no nasty side effects. Which, frankly, is the worst possible outcome, because it means as soon as something in the code or environment changes, everything could blow up and you have no idea why (been in that movie a few times).

Now to the question at hand:

I also know that on some architectures ("segmented machine" as I read somewhere), there are good reasons that the behavior is undefined.

And that's why it's undefined everywhere. There are some architectures still in use where different objects can be stored in different memory segments, and any differences in their addresses would be meaningless. There are just so many different memory models and addressing schemes that you cannot hope to define a behavior that works consistently for all of them (or the definition would be so complicated that it would be difficult to implement).

The philosophy behind C is to be maximally portable to as many architectures as possible, and to do that it imposes as few requirements on the implementation as possible. This is why the standard arithmetic types (int, float, etc.) are defined by the minimum range of values that they can represent with a minimum precision, not by the number of bits they take up. It's why pointers to different types may have different sizes and alignments.

Adding language that would make some behaviors undefined on this list of architectures vs. unspecified on that list of architectures would be a headache, both for the standards committee and various compiler implementors. It would mean adding a lot of special-case logic to compilers like gcc, which could make it less reliable as a compiler.

answered Oct 29 '22 22:10

John Bode

Related questions
                            
                                Is difference between two pointers legal c++17 constant expression?
                            
                                How to specify the compiler for CMAKE external project?
                            
                                Will the C++17 standard include "std::byte"?
                            
                                Correct way to allocate memory to std::shared_ptr
                            
                                Using enable_if with struct specialization
                            
                                Is it possible to set the value of elements in a constexpr array after declaration?
                            
                                POD members default initialization without braces
                            
                                How to create QList from std::vector
                            
                                Create an object without a name in C++
                            
                                What is a good design to add an "all" option to an enum in C++?
                            
                                Undefined reference to cv::imread(cv::String const&, int) [duplicate]
                            
                                Familiar template syntax for generic lambdas
                            
                                Different behavior when `std::lock_guard<std::mutex>` object has no name
                            
                                Why does this call to arrow (->) operator fail?
                            
                                How to add #define in Makefile?
                            
                                Type punning with (float&)int works, (float const&)int converts like (float)int instead?
                            
                                Check a concept against a type
                            
                                Are .h files treated differently by the compiler or "just" a naming convention?
                            
                                difference between i + 1 < vec.size() and i < vec.size() - 1
                            
                                std::reduce seems to convert results to integers [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is pointer difference between two arrays defined on specific implementations?

Tags:

c++

c

language-lawyer

pointer-arithmetic

Bérenger

People also ask

5 Answers

DevSolar

Antti Haapala -- Слава Україні

Nicol Bolas

klutt

John Bode

Recent Activity

Donate For Us