Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tr1::hash for boost::thread::id?

I started to use the unordered_set class from the tr1 namespace to speed-up access against the plain (tree-based) STL map. However, I wanted to store references to threads ID in boost (boost::thread::id), and realized that the API of those identifiers is so opaque that you cannot clearly obtain a hash of it.

Surprisingly, boost implements parts of the tr1 (including hash and unordered_set), but it does not define a hash class that is able to hash a thread ID.

Looking at the documentation of boost::thread::id I found that thread IDs can be output to a stream, so my solution for doing hashing was kind of:

struct boost_thread_id_hash
{
    size_t operator()(boost::thread::id const& id) const
    {
        std::stringstream ostr;
        ostr << id;
        std::tr1::hash<std::string> h;
        return h(ostr.str());
    }
};

That is, serialize it, apply the hash to the resulting string. However, this seems to be less efficient than actually using the STL map<boost::thread::id>.

So, my questions: Do you find a better way of doing this? Is it a clear inconsistency in both boost and tr1 not to force the existence of a hash<boost::thread::id> class?

Thanks.

like image 717
Diego Sevilla Avatar asked Apr 21 '09 11:04

Diego Sevilla


4 Answers

The overhead of stringifying thread::id (only to compute the string hash afterward) is, as you almost said yourself, astronomical compared to any performance benefits a tr1::unordered_map might confer vis-a-vis std::map. So the short answer would be: stick with std::map< thread::id, ... >

If you absolutely must use unordered containers, try to usenative_handle_type instead of thread::id if possible, i.e. prefer tr1::unordered_map< thread::native_handle_type, ... >, invoking thread::native_handle() instead of thread::get_id() when inserting and finding.

DO NOT attempt anything like the following:

struct boost_thread_id_hash {
   // one and only member of boost::thread::id is boost::thread::id::thread_data
   //   of type boost::detail::thread_data_ptr;
   // boost::thread::id::operator==(const id&) compares boost::thread::id::thread_data's
   size_t operator()(boost::thread::id const& id) const {
      const boost::detail::thread_data_ptr* pptdp = \
        reinterpret_cast< boost::detail::thread_data_ptr* >(&id);
      return h(pptdp->get());
   }
};

It could work, but is extremely brittle and an almost guaranteed timebomb. It assumes intimate knowledge of the inner workings of the thread::id implementation. It will get you cursed at by other developers. Don't do it if maintainability is of any concern! Even patching boost/thread/detail/thread.hpp to add size_t hash_value(const id& tid) as a friend of thread::id is "better". :)

like image 100
vladr Avatar answered Oct 20 '22 13:10

vladr


Some years late to answer this question, but this showed up as the most relevant one when trying to put a boost::thread::id in a std::unordered_map as key. Getting the native handle was a good suggestion in the accepted reply except that it is not available for this_thread.

Instead boost for sometime has a hash_value for thread::id, so this worked fine for me:

namespace boost {
  extern std::size_t hash_value(const thread::id &v);
}

namespace std {
  template<>
  struct hash<boost::thread::id> {
    std::size_t operator()(const boost::thread::id& v) const {
      return boost::hash_value(v);
    }
  };
}

Of course, need to link against libboost_thread library.

like image 37
Sumedh Avatar answered Oct 20 '22 14:10

Sumedh


The obvious question is why would you want to actually use a hash ?

I understand the issue with map / set for performance critical code, indeed those containers are not very cache friendly because the items might be allocated at very different memory locations.

As KeithB suggested (I won't comment on using the binary representation since nothing guarantees that 2 ids have the same binary representation after all...), using a sorted vector can speed up the code in case there is very few items.

Sorted vectors / deques are much more cache-friendly, however they suffer from a O(N) complexity on insert/erase because of the copying involved. Once you reach a couple hundreds threads (never seen that many by the way), it could hurt.

There is however, a data structure that tries to associate the benefits from maps and sorted vectors: the B+Tree.

You can view it as a map for which each node would contain more than one element (in sorted order). Only the leaf nodes are used.

To get some more performance you can:

  • Link the leaves linearly: ie the root caches a pointer to the first and last leaf and the leaves are interconnected themselves, so that linear travel completely bypass the interal nodes.
  • Cache the last accessed leaf in the root, after all it's likely that'll also be the next one accessed.

The asymptotic performances are the same than for the map, because it's implemented as a Balanced Binary Tree, but because the values are packed in groups, you're code can become faster by a constant.

The real difficulty is to tailor the size of each "bucket", you'll need some profiling for that so it would be better if your implementation allowed some customization there (as it will depend on the architecture on which the code is executed).

like image 3
Matthieu M. Avatar answered Oct 20 '22 13:10

Matthieu M.


Why do you want to store these in a set. Unless you doing something out of the ordinary, there will be a small number of threads. The overhead of maintaining a set is probably higher than just putting them in a vector and doing a linear search.

If searching will happen more frequently than adding and deleting, you can just use a sorted vector. There is a < operator defined for boost::thread::id, so you can sort the vector (or insert into the correct place) after each addition or deletion, and use lower_bound() to do a binary search. This is the same complexity as searching a set, and should have lower overhead for small amounts of data.

If you still need to do this, how about just treating it as a sizeof(boost::thread:id) bytes, and operating on those.

This example assumes that the size of boost::thread::id is a multiple of the size of an int, and that there is no packing, and no virtual functions. If that is not true, it will have to be modified, or will not work at all.

EDIT: I took a look at the boost::thread::id class, and it has a boost::shared_pointer<> as a member, so the code below is horribly broken. I think the only solution is to have the authors of boost::thread add a hash function. I'm leaving the example just in case its useful in some other context.

boost::thread::id id;
unsigned* data;
// The next line doesn't do anything useful in this case.
data = reinterpret_cast<unsigned *>(&id);
unsigned hash = 0;

for (unsigned int i = 0; i < sizeof(boost::thread::id)/4; i++)
  hash ^= data[i];
like image 2
KeithB Avatar answered Oct 20 '22 13:10

KeithB