Is it possible to portably hash a pointer in C++03, which does not have <code>std::hash</code> defined? It seems really weird for hashables containing pointers to be impossible in C++, but I can't think of any way of making them. The closest way I can think of is doing <code>reinterpret_cast<uintptr_t>(ptr)</code>, but <code>uintptr_t</code> is not required to be defined in C++03, and I'm not sure if the value could be legally manipulated even if it was defined... is this even possible?

No, in general. In fact it's not even possible in general in C++11 without <code>std::hash</code>. The reason why lies in the difference between values and value representations. You may recall the very common example used to demonstrate the different between a value and its representation: the null pointer value. Many people mistakenly assume that the representation for this value is all bits zero. This is not guaranteed in any fashion. You are guaranteed behavior by its value only. For another example, consider: <pre class="prettyprint"><code>int i; int* x = &i; int* y = &i; x == y; // this is true; the two pointer values are equal </code></pre> Underneath that, though, the value representation for <code>x</code> and <code>y</code> could be different! Let's play compiler. We'll implement the value representation for pointers. Let's say we need (for hypothetical architecture reasons) the pointers to be at least two bytes, but only one is used for the value. I'll just jump ahead and say it could be something like this: <pre class="prettyprint"><code>struct __pointer_impl { std::uint8_t byte1; // contains the address we're holding std::uint8_t byte2; // needed for architecture reasons, unused // (assume no padding; we are the compiler, after all) }; </code></pre> Okay, this is our value representation, now lets implement the value semantics. First, equality: <pre class="prettyprint"><code>bool operator==(const __pointer_impl& first, const __pointer_impl& second) { return first.byte1 == second.byte1; } </code></pre> Because the pointer's value is really only contained in the first byte (even though its representation has two bytes), that's all we have to compare. The second byte is irrelevant, even if they differ. We need the address-of operator implementation, of course: <pre class="prettyprint"><code>__pointer_impl address_of(int& i) { __pointer_impl result; result.byte1 = /* hypothetical architecture magic */; return result; } </code></pre> This particular implementation overload gets us a pointer value representation for a given <code>int</code>. Note that the second byte is left uninitialized! That's okay: it's not important for the value. This is really all we need to drive the point home. Pretend the rest of the implementation is done. :) So now consider our first example again, "compiler-ized": <pre class="prettyprint"><code>int i; /* int* x = &i; */ __pointer_impl x = __address_of(i); /* int* y = &i; */ __pointer_impl y = __address_of(i); x == y; // this is true; the two pointer values are equal </code></pre> For our tiny example on the hypothetical architecture, this sufficiently provides the guarantees required by the standard for pointer values. But note you are never guaranteed that <code>x == y</code> implies <code>memcmp(&x, &y, sizeof(__pointer_impl)) == 0</code>. There simply aren't requirements on the value representation to do so. Now consider your question: how do we hash pointers? That is, we want to implement: <pre class="prettyprint"><code>template <typename T> struct myhash; template <typename T> struct myhash<T*> : std::unary_function<T*, std::size_t> { std::size_t operator()(T* const ptr) const { return /* ??? */; } }; </code></pre> The most important requirement is that if <code>x == y</code>, then <code>myhash()(x) == myhash()(y)</code>. We also already know how to hash integers. What can we do? The only thing we can do is try to is somehow convert the pointer to an integer. Well, C++11 gives us <code>std::uintptr_t</code>, so we can do this, right? <pre class="prettyprint"><code>return myhash<std::uintptr_t>()(reinterpret_cast<std::uintptr_t>(ptr)); </code></pre> Perhaps surprisingly, this is not correct. To understand why, imagine again we're implementing it: <pre class="prettyprint"><code>// okay because we assumed no padding: typedef std::uint16_t __uintptr_t; // will be used for std::uintptr_t implementation __uintptr_t __to_integer(const __pointer_impl& ptr) { __uintptr_t result; std::memcpy(&result, &ptr, sizeof(__uintptr_t)); return result; } __pointer_impl __from_integer(const __uintptr_t& ptrint) { __pointer_impl result; std::memcpy(&result, &ptrint, sizeof(__pointer_impl)); return result; } </code></pre> So when we <code>reinterpret_cast</code> a pointer to integer, we'll use <code>__to_integer</code>, and going back we'll use <code>__from_integer</code>. Note that the resulting integer will have a value depending upon the bits in the value representation of pointers. That is, two equal pointer values could end up with different integer representations...and this is allowed! This is allowed because the result of <code>reinterpret_cast</code> is totally implementation-defined; you're only guaranteed the resulting of the opposite <code>reinterpret_cast</code> gives you back the same result. So there's the first issue: on this implementation, our hash could end up different for equal pointer values. This idea is out. Maybe we can reach into the representation itself and hash the bytes together. But this obviously ends up with the same issue, which is what the comments on your question are alluding to. Those pesky unused representation bits are always in the way, and there's no way to figure out where they are so we can ignore them. We're stuck! It's just not possible. In general. Remember, in practice we compile for certain implementations, and because the results of these operations are implementation-defined they are reliable if you take care to only use them properly. This is what Mats Petersson is saying: find out the guarantees of the implementation and you'll be fine. In fact, most consumer platforms you use will handle the <code>std::uintptr_t</code> attempt just fine. If it's not available on your system, or if you want an alternative approach, just combine the hashes of the individual bytes in the pointer. All this requires to work is that the unused representation bits always take on the same value. In fact, this is the approach MSVC2012 uses! Had our hypothetical pointer implementation simply always initialized <code>byte2</code> to a constant, it would work there as well. But there just isn't any requirement for implementations to do so. Hope this clarifies a few things.

Is it possible to hash pointers in portable C++03 code?

Tags:

c++

pointers

language-lawyer

hash

Is it possible to portably hash a pointer in C++03, which does not have std::hash defined?

It seems really weird for hashables containing pointers to be impossible in C++, but I can't think of any way of making them.

The closest way I can think of is doing reinterpret_cast<uintptr_t>(ptr), but uintptr_t is not required to be defined in C++03, and I'm not sure if the value could be legally manipulated even if it was defined... is this even possible?

278

asked Jan 05 '13 01:01

user541686

1 Answers

No, in general. In fact it's not even possible in general in C++11 without std::hash.

The reason why lies in the difference between values and value representations.

You may recall the very common example used to demonstrate the different between a value and its representation: the null pointer value. Many people mistakenly assume that the representation for this value is all bits zero. This is not guaranteed in any fashion. You are guaranteed behavior by its value only.

For another example, consider:

int i;
int* x = &i;
int* y = &i;

x == y;  // this is true; the two pointer values are equal

Underneath that, though, the value representation for x and y could be different!

Let's play compiler. We'll implement the value representation for pointers. Let's say we need (for hypothetical architecture reasons) the pointers to be at least two bytes, but only one is used for the value.

I'll just jump ahead and say it could be something like this:

struct __pointer_impl
{
    std::uint8_t byte1; // contains the address we're holding
    std::uint8_t byte2; // needed for architecture reasons, unused
    // (assume no padding; we are the compiler, after all)
};

Okay, this is our value representation, now lets implement the value semantics. First, equality:

bool operator==(const __pointer_impl& first, const __pointer_impl& second)
{
    return first.byte1 == second.byte1;
}

Because the pointer's value is really only contained in the first byte (even though its representation has two bytes), that's all we have to compare. The second byte is irrelevant, even if they differ.

We need the address-of operator implementation, of course:

__pointer_impl address_of(int& i)
{
    __pointer_impl result;

    result.byte1 = /* hypothetical architecture magic */;

    return result;
}

This particular implementation overload gets us a pointer value representation for a given int. Note that the second byte is left uninitialized! That's okay: it's not important for the value.

This is really all we need to drive the point home. Pretend the rest of the implementation is done. :)

So now consider our first example again, "compiler-ized":

int i;

/* int* x = &i; */
__pointer_impl x = __address_of(i);

/* int* y = &i; */
__pointer_impl y = __address_of(i);

x == y;  // this is true; the two pointer values are equal

For our tiny example on the hypothetical architecture, this sufficiently provides the guarantees required by the standard for pointer values. But note you are never guaranteed that x == y implies memcmp(&x, &y, sizeof(__pointer_impl)) == 0. There simply aren't requirements on the value representation to do so.

Now consider your question: how do we hash pointers? That is, we want to implement:

template <typename T>
struct myhash;

template <typename T>
struct myhash<T*> :
    std::unary_function<T*, std::size_t>
{
    std::size_t operator()(T* const ptr) const
    {
        return /* ??? */;
    }
};

The most important requirement is that if x == y, then myhash()(x) == myhash()(y). We also already know how to hash integers. What can we do?

The only thing we can do is try to is somehow convert the pointer to an integer. Well, C++11 gives us std::uintptr_t, so we can do this, right?

return myhash<std::uintptr_t>()(reinterpret_cast<std::uintptr_t>(ptr));

Perhaps surprisingly, this is not correct. To understand why, imagine again we're implementing it:

// okay because we assumed no padding:
typedef std::uint16_t __uintptr_t; // will be used for std::uintptr_t implementation

__uintptr_t __to_integer(const __pointer_impl& ptr)
{
    __uintptr_t result;
    std::memcpy(&result, &ptr, sizeof(__uintptr_t));

    return result;
}

__pointer_impl __from_integer(const __uintptr_t& ptrint)
{
    __pointer_impl result;
    std::memcpy(&result, &ptrint, sizeof(__pointer_impl));

    return result;
}

So when we reinterpret_cast a pointer to integer, we'll use __to_integer, and going back we'll use __from_integer. Note that the resulting integer will have a value depending upon the bits in the value representation of pointers. That is, two equal pointer values could end up with different integer representations...and this is allowed!

This is allowed because the result of reinterpret_cast is totally implementation-defined; you're only guaranteed the resulting of the opposite reinterpret_cast gives you back the same result.

So there's the first issue: on this implementation, our hash could end up different for equal pointer values.

This idea is out. Maybe we can reach into the representation itself and hash the bytes together. But this obviously ends up with the same issue, which is what the comments on your question are alluding to. Those pesky unused representation bits are always in the way, and there's no way to figure out where they are so we can ignore them.

We're stuck! It's just not possible. In general.

Remember, in practice we compile for certain implementations, and because the results of these operations are implementation-defined they are reliable if you take care to only use them properly. This is what Mats Petersson is saying: find out the guarantees of the implementation and you'll be fine.

In fact, most consumer platforms you use will handle the std::uintptr_t attempt just fine. If it's not available on your system, or if you want an alternative approach, just combine the hashes of the individual bytes in the pointer. All this requires to work is that the unused representation bits always take on the same value. In fact, this is the approach MSVC2012 uses!

Had our hypothetical pointer implementation simply always initialized byte2 to a constant, it would work there as well. But there just isn't any requirement for implementations to do so.

Hope this clarifies a few things.

190

answered Nov 15 '22 16:11

GManNickG

Related questions
                            
                                Flushing denormalised numbers to zero
                            
                                What is the meaning of contiguous memory in C++?
                            
                                Why does my program fail to link when I change the order of g++'s arguments? [duplicate]
                            
                                Access struct property by variable value
                            
                                Preference on initialising variables in C++
                            
                                Is it a memory leak to push_back a pointer into a vector of pointers?
                            
                                Haskell FFI: Interfacing with simple C++?
                            
                                Class has virtual method but non virtual destructor C++ [duplicate]
                            
                                Get each nth element of iterator range
                            
                                OpenCV 2.4 Jpeg to PNG with alpha channel
                            
                                Why is padding added for multiple data members of structures and not for single members?
                            
                                Install & Compile ZeroMQ/ZMQ/0MQ on Ubuntu 12.04 32bit
                            
                                Replace Standard C++ Allocator?
                            
                                Free UML tool with c++ code generation and doxygen support
                            
                                Is there a default hash function for an unordered_set of a custom class?
                            
                                C++ struct template
                            
                                Why is it clear that a template function instantiation will not be inlined?
                            
                                C++ strings why cant be used as char arrays?
                            
                                direct initialization and copy initialization of reference
                            
                                Resolving ambiguous overload of operator[]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With