Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to hash pointers in portable C++03 code?

Is it possible to portably hash a pointer in C++03, which does not have std::hash defined?

It seems really weird for hashables containing pointers to be impossible in C++, but I can't think of any way of making them.

The closest way I can think of is doing reinterpret_cast<uintptr_t>(ptr), but uintptr_t is not required to be defined in C++03, and I'm not sure if the value could be legally manipulated even if it was defined... is this even possible?

like image 278
user541686 Avatar asked Jan 05 '13 01:01

user541686


People also ask

Can you hash a pointer?

Sometimes you need to take a hash function of a pointer; not the object the pointer points to, but the pointer itself. Lots of the time, folks just punt and use the pointer value as an integer, chop off some high bits to make it fit, maybe shift out known-zero bits at the bottom.

Can you hash a pointer in C++?

you probably mean std::unordered_map and the answer is practicaly the same: you can use pointers if you really mean to hash pointers (and not the object they point to). You can as well implement your own hash (or hash-redirection) to work with the pointed-to objects.

What is hash in Cpp?

A Hash Table in C/C++ (Associative array) is a data structure that maps keys to values. This uses a hash function to compute indexes for a key. Based on the Hash Table index, we can store the value at the appropriate location.


1 Answers

No, in general. In fact it's not even possible in general in C++11 without std::hash.

The reason why lies in the difference between values and value representations.

You may recall the very common example used to demonstrate the different between a value and its representation: the null pointer value. Many people mistakenly assume that the representation for this value is all bits zero. This is not guaranteed in any fashion. You are guaranteed behavior by its value only.

For another example, consider:

int i;
int* x = &i;
int* y = &i;

x == y;  // this is true; the two pointer values are equal

Underneath that, though, the value representation for x and y could be different!

Let's play compiler. We'll implement the value representation for pointers. Let's say we need (for hypothetical architecture reasons) the pointers to be at least two bytes, but only one is used for the value.

I'll just jump ahead and say it could be something like this:

struct __pointer_impl
{
    std::uint8_t byte1; // contains the address we're holding
    std::uint8_t byte2; // needed for architecture reasons, unused
    // (assume no padding; we are the compiler, after all)
};

Okay, this is our value representation, now lets implement the value semantics. First, equality:

bool operator==(const __pointer_impl& first, const __pointer_impl& second)
{
    return first.byte1 == second.byte1;
}

Because the pointer's value is really only contained in the first byte (even though its representation has two bytes), that's all we have to compare. The second byte is irrelevant, even if they differ.

We need the address-of operator implementation, of course:

__pointer_impl address_of(int& i)
{
    __pointer_impl result;

    result.byte1 = /* hypothetical architecture magic */;

    return result;
}

This particular implementation overload gets us a pointer value representation for a given int. Note that the second byte is left uninitialized! That's okay: it's not important for the value.

This is really all we need to drive the point home. Pretend the rest of the implementation is done. :)

So now consider our first example again, "compiler-ized":

int i;

/* int* x = &i; */
__pointer_impl x = __address_of(i);

/* int* y = &i; */
__pointer_impl y = __address_of(i);

x == y;  // this is true; the two pointer values are equal

For our tiny example on the hypothetical architecture, this sufficiently provides the guarantees required by the standard for pointer values. But note you are never guaranteed that x == y implies memcmp(&x, &y, sizeof(__pointer_impl)) == 0. There simply aren't requirements on the value representation to do so.

Now consider your question: how do we hash pointers? That is, we want to implement:

template <typename T>
struct myhash;

template <typename T>
struct myhash<T*> :
    std::unary_function<T*, std::size_t>
{
    std::size_t operator()(T* const ptr) const
    {
        return /* ??? */;
    }
};

The most important requirement is that if x == y, then myhash()(x) == myhash()(y). We also already know how to hash integers. What can we do?

The only thing we can do is try to is somehow convert the pointer to an integer. Well, C++11 gives us std::uintptr_t, so we can do this, right?

return myhash<std::uintptr_t>()(reinterpret_cast<std::uintptr_t>(ptr));

Perhaps surprisingly, this is not correct. To understand why, imagine again we're implementing it:

// okay because we assumed no padding:
typedef std::uint16_t __uintptr_t; // will be used for std::uintptr_t implementation

__uintptr_t __to_integer(const __pointer_impl& ptr)
{
    __uintptr_t result;
    std::memcpy(&result, &ptr, sizeof(__uintptr_t));

    return result;
}

__pointer_impl __from_integer(const __uintptr_t& ptrint)
{
    __pointer_impl result;
    std::memcpy(&result, &ptrint, sizeof(__pointer_impl));

    return result;
}

So when we reinterpret_cast a pointer to integer, we'll use __to_integer, and going back we'll use __from_integer. Note that the resulting integer will have a value depending upon the bits in the value representation of pointers. That is, two equal pointer values could end up with different integer representations...and this is allowed!

This is allowed because the result of reinterpret_cast is totally implementation-defined; you're only guaranteed the resulting of the opposite reinterpret_cast gives you back the same result.

So there's the first issue: on this implementation, our hash could end up different for equal pointer values.

This idea is out. Maybe we can reach into the representation itself and hash the bytes together. But this obviously ends up with the same issue, which is what the comments on your question are alluding to. Those pesky unused representation bits are always in the way, and there's no way to figure out where they are so we can ignore them.

We're stuck! It's just not possible. In general.

Remember, in practice we compile for certain implementations, and because the results of these operations are implementation-defined they are reliable if you take care to only use them properly. This is what Mats Petersson is saying: find out the guarantees of the implementation and you'll be fine.

In fact, most consumer platforms you use will handle the std::uintptr_t attempt just fine. If it's not available on your system, or if you want an alternative approach, just combine the hashes of the individual bytes in the pointer. All this requires to work is that the unused representation bits always take on the same value. In fact, this is the approach MSVC2012 uses!

Had our hypothetical pointer implementation simply always initialized byte2 to a constant, it would work there as well. But there just isn't any requirement for implementations to do so.

Hope this clarifies a few things.

like image 190
GManNickG Avatar answered Nov 15 '22 16:11

GManNickG