Template code bloat with unordered_map

Question

I wondered if unordered_map is implemented using type erasure, since an unordered_map<Key, A*> and unordered_map<Key, B*> can use exactly the same code (apart from casting, which is a no-op in machine code). That is, the implementation of both could be based on unordered_map<Key, void*> to save code size.

Update: This technique is commonly referred to as the Thin Template Idiom (Thanks to the commenters below for pointing that out).

Update 2: I would be particlarly interested in Howard Hinnant's opinion. Let's hope he reads this.

So I wrote this small test:

#include <iostream>

#if BOOST
# include <boost/unordered_map.hpp>
  using boost::unordered_map;
#else
# include <unordered_map>
  using std::unordered_map;
#endif

struct A { A(int x) : x(x) {} int x; };
struct B { B(int x) : x(x) {} int x; };

int main()
{
#if SMALL
    unordered_map<std::string, void*> ma, mb;
#else
    unordered_map<std::string, A*> ma;
    unordered_map<std::string, B*> mb;
#endif

    ma["foo"] = new A(1);
    mb["bar"] = new B(2);

    std::cout << ((A*) ma["foo"])->x << std::endl;
    std::cout << ((B*) mb["bar"])->x << std::endl;

    // yes, it leaks.
}

And determined the size of the compiled output with various settings:

#!/bin/sh

for BOOST in 0 1 ; do
    for OPT in 2 3 s ; do
        for SMALL in 0 1 ; do
            clang++ -stdlib=libc++ -O${OPT} -DSMALL=${SMALL} -DBOOST=${BOOST} map_test.cpp -o map_test
            strip map_test
            SIZE=$(echo "scale=1;$(stat -f "%z" map_test)/1024" | bc)
            echo boost=$BOOST opt=$OPT small=$SMALL size=${SIZE}K
        done
    done
done

It turns out, that with all settings I tried, lots of inner code of unordered_map seems to be instantiated twice:

With Clang and libc++:
          |   -O2   |   -O3   |   -Os
-DSMALL=0 |  24.7K  |  23.5K  |  28.2K
-DSMALL=1 |  17.9K  |  17.2K  |  19.8K


With Clang and Boost:
          |   -O2   |   -O3   |   -Os
-DSMALL=0 |  23.9K  |  23.9K  |  32.5K
-DSMALL=1 |  17.4K  |  17.4K  |  22.3K


With GCC and Boost:
          |   -O2   |   -O3   |   -Os
-DSMALL=0 |  21.8K  |  21.8K  |  35.5K
-DSMALL=1 |  16.4K  |  16.4K  |  26.2K

(With the compilers from Apple's Xcode)

Now to the question: Is there some convincing technical reason due to which the implementers have chosen to omit this simple optimization?

Also: why the hell is the effect of -Os exactly the opposite of what is advertised?

Update 3:

As suggested by Nicol Bolas, I have repeated the measurements with shared_ptr<void/A/B> instead of naked pointers (created with make_shared and cast with static_pointer_cast). The tendency in the results is the same:

With Clang and libc++:
          |   -O2   |   -O3   |   -Os
-DSMALL=0 |  27.9K  |  26.7K  |  30.9K
-DSMALL=1 |  25.0K  |  20.3K  |  26.8K


With Clang and Boost:
          |   -O2   |   -O3   |   -Os
-DSMALL=0 |  35.3K  |  34.3K  |  43.1K
-DSMALL=1 |  27.8K  |  26.8K  |  32.6K

Howard Hinnant · Accepted Answer

Since I've been specifically asked to comment, I will, though I'm not sure I have much more to add than has already been said. (sorry it took me 8 days to get here)

I've implemented the thin template idiom before, for some containers, namely vector, deque and list. I don't currently have it implemented for any container in libc++. And I've never implemented it for the unordered containers.

It does save on code size. It also adds complexity, much more so than the referenced wikibooks link implies. One can also do it for more than just pointers. You can do it for all scalars which have the same size. For example why have different instantiations for int and unsigned? Even ptrdiff_t can be stored in the same instantiation as T*. After all, it is all just a bag bits at the bottom. But it is extremely tricky to get the member templates which take a range of iterators correct when playing these tricks.

There are disadvantages though (besides difficulty of implementation). It doesn't play nearly as nicely with the debugger. At the very least it makes it much more difficult for the debugger to display container innards. And while the code size savings can be significant, I would stop short of calling the code size savings dramatic. Especially when compared to the memory required to store the photographs, animations, audio clips, street maps, years of email with all of the attachments from your best friends and family, etc. I.e. optimizing code size is important. But you should take into account that in many apps today (even on embedded devices), if you cut your code size in half, you might cut your app size by 5% (statistics admittedly pulled from thin air).

My current position is that this particular optimization is one best paid for and implemented in the linker instead of in the template container. Though I know this isn't easy to implement in the linker, I have heard of successful implementations.

That being said, I still do try to make code size optimizations in templates. For example in the libc++ helper structures such as __hash_map_node_destructor are templated on as few parameters as possible, so if any of their code gets outlined, it is more likely that one instantiation of the helper can serve more than one instantiation of unordered_map. This technique is debugger friendly, and not that hard to get right. And can even have some positive side effects for the client when applied to iterators (N2980).

In summary, I wouldn't hold it against code for going the extra mile and implementing this optimization. But I also wouldn't classify it as high a priority as I did a decade ago, both because linker technology has progressed, and the ratio of code size to application size has tended to fairly dramatically decrease.

Template code bloat with unordered_map

Tags:

c++

boost

unordered-map

type-erasure

libc++

marton78

1 Answers

Howard Hinnant

Recent Activity

Donate For Us

Template code bloat with unordered_map

Tags:

c++

boost

unordered-map

type-erasure

libc++

marton78

1 Answers

Howard Hinnant

Related questions

Recent Activity

Donate For Us