I was recently answering a question on the undefined behaviour of doing <code>p < q</code> in C when <code>p</code> and <code>q</code> are pointers into different objects/arrays. That got me thinking: C++ has the same (undefined) behaviour of <code><</code> in this case, but also offers the standard library template <code>std::less</code> which is guaranteed to return the same thing as <code><</code> when the pointers can be compared, and return some consistent ordering when they cannot. Does C offer something with similar functionality which would allow safely comparing arbitrary pointers (to the same type)? I tried looking through the C11 standard and didn't find anything, but my experience in C is orders of magnitude smaller than in C++, so I could have easily missed something.

On implementations with a flat memory model (basically everything), casting to <code>uintptr_t</code> will Just Work. (But see Should pointer comparisons be signed or unsigned in 64-bit x86? for discussion of whether you should treat pointers as signed or not, including issues of forming pointers outside of objects which is UB in C.) But systems with non-flat memory models do exist, and thinking about them can help explain the current situation, like C++ having different specs for <code><</code> vs. <code>std::less</code>. <hr> Part of the point of <code><</code> on pointers to separate objects being UB in C (or at least unspecified in some C++ revisions) is to allow for weird machines, including non-flat memory models. A well-known example is x86-16 real mode where pointers are segment:offset, forming a 20-bit linear address via <code>(segment << 4) + offset</code>. The same linear address can be represented by multiple different seg:off combinations. C++ <code>std::less</code> on pointers on weird ISAs might need to be expensive, e.g. "normalize" a segment:offset on x86-16 to have offset <= 15. However, there's no portable way to implement this. The manipulation required to normalize a <code>uintptr_t</code> (or the object-representation of a pointer object) is implementation-specific. But even on systems where C++ <code>std::less</code> has to be expensive, <code><</code> doesn't have to be. For example, assuming a "large" memory model where an object fits within one segment, <code><</code> can just compare the offset part and not even bother with the segment part. (Pointers inside the same object will have the same segment, and otherwise it's UB in C. C++17 changed to merely "unspecified", which might still allow skipping normalization and just comparing offsets.) This is assuming all pointers to any part of an object always use the same <code>seg</code> value, never normalizing. This is what you'd expect an ABI to require for a "large" as opposed to "huge" memory model. (See discussion in comments). (Such a memory model might have a max object size of 64kiB for example, but a much larger max total address space that has room for many such max-sized objects. ISO C allows implementations to have a limit on object size that's lower than the max value (unsigned) <code>size_t</code> can represent, <code>SIZE_MAX</code>. For example even on flat memory model systems, GNU C limits max object size to <code>PTRDIFF_MAX</code> so size calculation can ignore signed overflow.) See this answer and discussion in comments. If you want to allow objects larger than a segment, you need a "huge" memory model that has to worry about overflowing the offset part of a pointer when doing <code>p++</code> to loop through an array, or when doing indexing / pointer arithmetic. This leads to slower code everywhere, but would probably mean that <code>p < q</code> would happen to work for pointers to different objects, because an implementation targeting a "huge" memory model would normally choose to keep all pointers normalized all the time. See What are near, far and huge pointers? - some real C compilers for x86 real mode did have an option to compile for the "huge" model where all pointers defaulted to "huge" unless declared otherwise. x86 real-mode segmentation isn't the only non-flat memory model possible, it's merely a useful concrete example to illustrate how it's been handled by C/C++ implementations. In real life, implementations extended ISO C with the concept of <code>far</code> vs. <code>near</code> pointers, allowing programmers to choose when they can get away with just storing / passing around the 16-bit offset part, relative to some common data segment. But a pure ISO C implementation would have to choose between a small memory model (everything except code in the same 64kiB with 16-bit pointers) or large or huge with all pointers being 32-bit. Some loops could optimize by incrementing just the offset part, but pointer objects couldn't be optimized to be smaller. <hr> If you knew what the magic manipulation was for any given implementation, you could implement it in pure C. The problem is that different systems use different addressing and the details aren't parameterized by any portable macros. Or maybe not: it might involve looking something up from a special segment table or something, e.g. like x86 protected mode instead of real mode where the segment part of the address is an index, not a value to be left shifted. You could set up partially-overlapping segments in protected mode, and the segment selector parts of addresses wouldn't necessarily even be ordered in the same order as the corresponding segment base addresses. Getting a linear address from a seg:off pointer in x86 protected mode might involve a system call, if the GDT and/or LDT aren't mapped into readable pages in your process. (Of course mainstream OSes for x86 use a flat memory model so the segment base is always 0 (except for thread-local storage using <code>fs</code> or <code>gs</code> segments), and only the 32-bit or 64-bit "offset" part is used as a pointer.) You could manually add code for various specific platforms, e.g. by default assume flat, or <code>#ifdef</code> something to detect x86 real mode and split <code>uintptr_t</code> into 16-bit halves for <code>seg -= off>>4; off &= 0xf;</code> then combine those parts back into a 32-bit number.

I once tried to find a way around this and I did find a solution that works for overlapping objects and in most other cases assuming the compiler does the "usual" thing. You can first implement the suggestion in How to implement memmove in standard C without an intermediate copy? and then if that doesn't work cast to <code>uintptr</code> (a wrapper type for either <code>uintptr_t</code> or <code>unsigned long long</code> depending on whether <code>uintptr_t</code> is available) and get a most-likely accurate result (although it probably wouldn't matter anyway): <pre class="prettyprint"><code>#include <stdint.h> #ifndef UINTPTR_MAX typedef unsigned long long uintptr; #else typedef uintptr_t uintptr; #endif int pcmp(const void *p1, const void *p2, size_t len) { const unsigned char *s1 = p1; const unsigned char *s2 = p2; size_t l; /* Check for overlap */ for( l = 0; l < len; l++ ) { if( s1 + l == s2 || s1 + l == s2 + len - 1 ) { /* The two objects overlap, so we're allowed to use comparison operators. */ if(s1 > s2) return 1; else if (s1 < s2) return -1; else return 0; } } /* No overlap so the result probably won't really matter. Cast the result to `uintptr` and hope the compiler does the "usual" thing */ if((uintptr)s1 > (uintptr)s2) return 1; else if ((uintptr)s1 < (uintptr)s2) return -1; else return 0; } </code></pre>

<blockquote> Does C offer something with similar functionality which would allow safely comparing arbitrary pointers. </blockquote> No <hr> First let us only consider object pointers. Function pointers bring in a whole other set of concerns. 2 pointers <code>p1, p2</code> can have different encodings and point to the same address so <code>p1 == p2</code> even though <code>memcmp(&p1, &p2, sizeof p1)</code> is not 0. Such architectures are rare. Yet conversion of these pointer to <code>uintptr_t</code> does not require the same integer result leading to <code>(uintptr_t)p1 != (uinptr_t)p2</code>. <code>(uintptr_t)p1 < (uinptr_t)p2</code> itself is well legal code, by may not provide the hoped for functionality. <hr> If code truly needs to compare unrelated pointers, form a helper function <code>less(const void *p1, const void *p2)</code> and perform platform specific code there. Perhaps: <pre class="prettyprint"><code>// return -1,0,1 for <,==,> int ptrcmp(const void *c1, const void *c1) { // Equivalence test works on all platforms if (c1 == c2) { return 0; } // At this point, we know pointers are not equivalent. #ifdef UINTPTR_MAX uintptr_t u1 = (uintptr_t)c1; uintptr_t u2 = (uintptr_t)c2; // Below code "works" in that the computation is legal, // but does it function as desired? // Likely, but strange systems lurk out in the wild. // Check implementation before using #if tbd return (u1 > u2) - (u1 < u2); #else #error TBD code #endif #else #error TBD code #endif } </code></pre>

The C Standard explicitly allows implementations to behave "in a documented manner characteristic of the environment" when an action invokes "Undefined Behavior". When the Standard was written, it would have been obvious to everyone that implementations intended for low-level programming on platforms with a flat memory model should do precisely that when processing relational operators between arbitrary pointers. It also would have been obvious that implementations targeting platforms whose natural means of pointer comparisons would never have side effects should perform comparisons between arbitrary pointers in ways that don't have side effects. There are three general circumstances where programmers might perform relational operators between pointers: <ol> <li> Pointers to unrelated objects will never be compared. </li> <li> Code may compare pointers within an object in cases where the results would matter, or between unrelated objects in cases where the results wouldn't matter. A simple example of this would be an operation that can act upon possibly-overlapping array segments in either ascending or descending order. The choice of ascending or descending order would matter in cases where the objects overlap, but either order would be equally valid when acting upon array segments in unrelated objects. </li> <li> Code relies upon comparisons yielding a transitive ordering consistent with pointer equality. </li> </ol> The third type of usage would seldom occur outside of platform-specific code, which would either know that relational operators would simply work, or would know a platform-specific alternative. The second type of usage could occur in code which should be mostly portable, but almost all implementations could support the second type of usage just as cheaply as the first and there would be no reasons for them to do otherwise. The only people who should have any reason to care about whether the second usage was defined would be people writing compilers for platforms where such comparisons would be expensive or those seeking to ensure that their programs would be compatible with such platforms. Such people would be better placed than the Committee to judge the pros and cons of upholding a "no side effects" guarantee, and thus the Committee leaves the question open. To be sure, the fact that there would be no reason for a compiler not to process a construct usefully is no guarantee that a "Gratuitously Clever Compiler" won't use the Standard as an excuse to do otherwise, but the reason the C Standard doesn't define a "less" operator is that the Committee expected that "<" would be adequate for almost all programs on almost all platforms.

Does C have an equivalent of std::less from C++?

Q: Does C have an STL?

C can't have an "exact equivalent" of STL because C doesn't have templates or classes.

Q: What type is std :: less?

The std::less is a is a member of the functional class (<functional. h>) used for performing comparisons. It is defined as a function object class for less than inequality comparison which returns a boolean value depending upon the condition.

Tags:

c

pointers

memory-model

undefined-behavior

memory-segmentation

I was recently answering a question on the undefined behaviour of doing p < q in C when p and q are pointers into different objects/arrays. That got me thinking: C++ has the same (undefined) behaviour of < in this case, but also offers the standard library template std::less which is guaranteed to return the same thing as < when the pointers can be compared, and return some consistent ordering when they cannot.

Does C offer something with similar functionality which would allow safely comparing arbitrary pointers (to the same type)? I tried looking through the C11 standard and didn't find anything, but my experience in C is orders of magnitude smaller than in C++, so I could have easily missed something.

418

asked Oct 10 '19 11:10

Angew is no longer proud of SO

4 Answers

On implementations with a flat memory model (basically everything), casting to uintptr_t will Just Work.

(But see Should pointer comparisons be signed or unsigned in 64-bit x86? for discussion of whether you should treat pointers as signed or not, including issues of forming pointers outside of objects which is UB in C.)

But systems with non-flat memory models do exist, and thinking about them can help explain the current situation, like C++ having different specs for < vs. std::less.

Part of the point of < on pointers to separate objects being UB in C (or at least unspecified in some C++ revisions) is to allow for weird machines, including non-flat memory models.

A well-known example is x86-16 real mode where pointers are segment:offset, forming a 20-bit linear address via (segment << 4) + offset. The same linear address can be represented by multiple different seg:off combinations.

C++ std::less on pointers on weird ISAs might need to be expensive, e.g. "normalize" a segment:offset on x86-16 to have offset <= 15. However, there's no portable way to implement this. The manipulation required to normalize a uintptr_t (or the object-representation of a pointer object) is implementation-specific.

But even on systems where C++ std::less has to be expensive, < doesn't have to be. For example, assuming a "large" memory model where an object fits within one segment, < can just compare the offset part and not even bother with the segment part. (Pointers inside the same object will have the same segment, and otherwise it's UB in C. C++17 changed to merely "unspecified", which might still allow skipping normalization and just comparing offsets.) This is assuming all pointers to any part of an object always use the same seg value, never normalizing. This is what you'd expect an ABI to require for a "large" as opposed to "huge" memory model. (See discussion in comments).

(Such a memory model might have a max object size of 64kiB for example, but a much larger max total address space that has room for many such max-sized objects. ISO C allows implementations to have a limit on object size that's lower than the max value (unsigned) size_t can represent, SIZE_MAX. For example even on flat memory model systems, GNU C limits max object size to PTRDIFF_MAX so size calculation can ignore signed overflow.) See this answer and discussion in comments.

If you want to allow objects larger than a segment, you need a "huge" memory model that has to worry about overflowing the offset part of a pointer when doing p++ to loop through an array, or when doing indexing / pointer arithmetic. This leads to slower code everywhere, but would probably mean that p < q would happen to work for pointers to different objects, because an implementation targeting a "huge" memory model would normally choose to keep all pointers normalized all the time. See What are near, far and huge pointers? - some real C compilers for x86 real mode did have an option to compile for the "huge" model where all pointers defaulted to "huge" unless declared otherwise.

x86 real-mode segmentation isn't the only non-flat memory model possible, it's merely a useful concrete example to illustrate how it's been handled by C/C++ implementations. In real life, implementations extended ISO C with the concept of far vs. near pointers, allowing programmers to choose when they can get away with just storing / passing around the 16-bit offset part, relative to some common data segment.

But a pure ISO C implementation would have to choose between a small memory model (everything except code in the same 64kiB with 16-bit pointers) or large or huge with all pointers being 32-bit. Some loops could optimize by incrementing just the offset part, but pointer objects couldn't be optimized to be smaller.

If you knew what the magic manipulation was for any given implementation, you could implement it in pure C. The problem is that different systems use different addressing and the details aren't parameterized by any portable macros.

Or maybe not: it might involve looking something up from a special segment table or something, e.g. like x86 protected mode instead of real mode where the segment part of the address is an index, not a value to be left shifted. You could set up partially-overlapping segments in protected mode, and the segment selector parts of addresses wouldn't necessarily even be ordered in the same order as the corresponding segment base addresses. Getting a linear address from a seg:off pointer in x86 protected mode might involve a system call, if the GDT and/or LDT aren't mapped into readable pages in your process.

(Of course mainstream OSes for x86 use a flat memory model so the segment base is always 0 (except for thread-local storage using fs or gs segments), and only the 32-bit or 64-bit "offset" part is used as a pointer.)

You could manually add code for various specific platforms, e.g. by default assume flat, or #ifdef something to detect x86 real mode and split uintptr_t into 16-bit halves for seg -= off>>4; off &= 0xf; then combine those parts back into a 32-bit number.

answered Oct 12 '22 21:10

Peter Cordes

I once tried to find a way around this and I did find a solution that works for overlapping objects and in most other cases assuming the compiler does the "usual" thing.

You can first implement the suggestion in How to implement memmove in standard C without an intermediate copy? and then if that doesn't work cast to uintptr (a wrapper type for either uintptr_t or unsigned long long depending on whether uintptr_t is available) and get a most-likely accurate result (although it probably wouldn't matter anyway):

#include <stdint.h>
#ifndef UINTPTR_MAX
typedef unsigned long long uintptr;
#else
typedef uintptr_t uintptr;
#endif

int pcmp(const void *p1, const void *p2, size_t len)
{
    const unsigned char *s1 = p1;
    const unsigned char *s2 = p2;
    size_t l;

    /* Check for overlap */
    for( l = 0; l < len; l++ )
    {
        if( s1 + l == s2 || s1 + l == s2 + len - 1 )
        {
            /* The two objects overlap, so we're allowed to
               use comparison operators. */
            if(s1 > s2)
                return 1;
            else if (s1 < s2)
                return -1;
            else
                return 0;
        }
    }

    /* No overlap so the result probably won't really matter.
       Cast the result to `uintptr` and hope the compiler
       does the "usual" thing */
    if((uintptr)s1 > (uintptr)s2)
        return 1;
    else if ((uintptr)s1 < (uintptr)s2)
        return -1;
    else
        return 0;
}

answered Oct 12 '22 22:10

S.S. Anne

Does C offer something with similar functionality which would allow safely comparing arbitrary pointers.

First let us only consider object pointers. Function pointers bring in a whole other set of concerns.

2 pointers p1, p2 can have different encodings and point to the same address so p1 == p2 even though memcmp(&p1, &p2, sizeof p1) is not 0. Such architectures are rare.

Yet conversion of these pointer to uintptr_t does not require the same integer result leading to (uintptr_t)p1 != (uinptr_t)p2.

(uintptr_t)p1 < (uinptr_t)p2 itself is well legal code, by may not provide the hoped for functionality.

If code truly needs to compare unrelated pointers, form a helper function less(const void *p1, const void *p2) and perform platform specific code there.

Perhaps:

// return -1,0,1 for <,==,> 
int ptrcmp(const void *c1, const void *c1) {
  // Equivalence test works on all platforms
  if (c1 == c2) {
    return 0;
  }
  // At this point, we know pointers are not equivalent.
  #ifdef UINTPTR_MAX
    uintptr_t u1 = (uintptr_t)c1;
    uintptr_t u2 = (uintptr_t)c2;
    // Below code "works" in that the computation is legal,
    //   but does it function as desired?
    // Likely, but strange systems lurk out in the wild. 
    // Check implementation before using
    #if tbd
      return (u1 > u2) - (u1 < u2);
    #else
      #error TBD code
    #endif
  #else
    #error TBD code
  #endif 
}

answered Oct 12 '22 20:10

chux - Reinstate Monica

The C Standard explicitly allows implementations to behave "in a documented manner characteristic of the environment" when an action invokes "Undefined Behavior". When the Standard was written, it would have been obvious to everyone that implementations intended for low-level programming on platforms with a flat memory model should do precisely that when processing relational operators between arbitrary pointers. It also would have been obvious that implementations targeting platforms whose natural means of pointer comparisons would never have side effects should perform comparisons between arbitrary pointers in ways that don't have side effects.

There are three general circumstances where programmers might perform relational operators between pointers:

Pointers to unrelated objects will never be compared.
Code may compare pointers within an object in cases where the results would matter, or between unrelated objects in cases where the results wouldn't matter. A simple example of this would be an operation that can act upon possibly-overlapping array segments in either ascending or descending order. The choice of ascending or descending order would matter in cases where the objects overlap, but either order would be equally valid when acting upon array segments in unrelated objects.
Code relies upon comparisons yielding a transitive ordering consistent with pointer equality.

The third type of usage would seldom occur outside of platform-specific code, which would either know that relational operators would simply work, or would know a platform-specific alternative. The second type of usage could occur in code which should be mostly portable, but almost all implementations could support the second type of usage just as cheaply as the first and there would be no reasons for them to do otherwise. The only people who should have any reason to care about whether the second usage was defined would be people writing compilers for platforms where such comparisons would be expensive or those seeking to ensure that their programs would be compatible with such platforms. Such people would be better placed than the Committee to judge the pros and cons of upholding a "no side effects" guarantee, and thus the Committee leaves the question open.

To be sure, the fact that there would be no reason for a compiler not to process a construct usefully is no guarantee that a "Gratuitously Clever Compiler" won't use the Standard as an excuse to do otherwise, but the reason the C Standard doesn't define a "less" operator is that the Committee expected that "<" would be adequate for almost all programs on almost all platforms.