I've been writing an image processing algorithm now, and at some point I needed to collect some statistical information about the transformed pixels to gain some more insight about the direction I should follow with further development. The sort of information that I needed to collect was in the format:
key: RGB value
value: int
What I did, was I opened the transformed image and iterated through it, saving the values I needed to std::unordered_map
that has the following signature:
typedef std::unordered_map<boost::gil::rgb8_pixel_t, unsigned int> pixel_map_t;
In a loop:
for(int y = 0; y < vi.height(); y++) {
SrcView::x_iterator dst_it = src.row_begin(y);
for(int x = 0; x < vi.width(); x++, hits++) {
diff_map.insert(std::make_pair(dst_it[x], /* some uint32 */));
}
I also write a custom hash function (it was a perfect hash function: 256^2 x R + 256 x G +
B
- so the collisions should be minimal regardless of the buckets and hashtable's layout (to a reasonable extend).
What I noticed, was that the insertion was miserably slow! - after reaching the 11'th iteration, insertion speed degraded by about 100x. I had a massive amount of collisions! Despite the very small number of duplicated colors in the image.
Afterwards, I wanted to eliminate any possible fault within my code, and started benchmarking the unordered_map
using the STL hash functions with primitive key types such as int.
The code for the benchmark was:
std::size_t hits = 0, colls = 0;
for(int y = 0; y < vi.height(); y++) {
SrcView::x_iterator dst_it = src.row_begin(y);
for(int x = 0; x < vi.width(); x++, hits++) {
if(diff_map.find(x*y) != diff_map.cend())
colls++;
diff_map.insert(std::make_pair(x*y, 10));
}
std::cout << y << "/" << vi.height() << " -> buckets: "
<< diff_map.bucket_count() << "("
<< std::floor(diff_map.load_factor() * 100)
<< "% load factor) [ " << colls << " collisions / " << hits << " hits ]" << std::endl;
}
... and here are the results for the first 20 iterations of the outer loop (using only STL's hash function for int-typed key):
0/480 -> buckets: 8(12% load factor) [ 639 collisions / 640 hits ]
1/480 -> buckets: 4096(15% load factor) [ 640 collisions / 1280 hits ]
2/480 -> buckets: 4096(23% load factor) [ 960 collisions / 1920 hits ]
3/480 -> buckets: 4096(31% load factor) [ 1281 collisions / 2560 hits ]
4/480 -> buckets: 4096(37% load factor) [ 1654 collisions / 3200 hits ]
5/480 -> buckets: 4096(45% load factor) [ 1964 collisions / 3840 hits ]
6/480 -> buckets: 4096(51% load factor) [ 2370 collisions / 4480 hits ]
7/480 -> buckets: 4096(59% load factor) [ 2674 collisions / 5120 hits ]
8/480 -> buckets: 4096(65% load factor) [ 3083 collisions / 5760 hits ]
9/480 -> buckets: 4096(71% load factor) [ 3460 collisions / 6400 hits ]
10/480 -> buckets: 4096(77% load factor) [ 3872 collisions / 7040 hits ]
11/480 -> buckets: 4096(85% load factor) [ 4161 collisions / 7680 hits ]
12/480 -> buckets: 4096(90% load factor) [ 4612 collisions / 8320 hits ]
13/480 -> buckets: 4096(99% load factor) [ 4901 collisions / 8960 hits ]
14/480 -> buckets: 32768(13% load factor) [ 5315 collisions / 9600 hits ]
15/480 -> buckets: 32768(13% load factor) [ 5719 collisions / 10240 hits ]
16/480 -> buckets: 32768(14% load factor) [ 6148 collisions / 10880 hits ]
17/480 -> buckets: 32768(15% load factor) [ 6420 collisions / 11520 hits ]
18/480 -> buckets: 32768(16% load factor) [ 6870 collisions / 12160 hits ]
19/480 -> buckets: 32768(17% load factor) [ 7135 collisions / 12800 hits ]
20/480 -> buckets: 32768(17% load factor) [ 7584 collisions / 13440 hits ]
21/480 -> buckets: 32768(18% load factor) [ 7993 collisions / 14080 hits ]
Isn't the number of collisions too high in this case? STL libraries in general are of high quality, but having 639/640 and 640/1280 for simple int-based key sound at least weird. Or maybe I'm doing something wrong?
And this was my hash function (in theory, should have no collisions at all - but the numbers were very close):
template<>
struct std::hash<boost::gil::rgb8_pixel_t> :
public std::unary_function<const boost::gil::rgb8_pixel_t&, size_t>
{
size_t operator()(const boost::gil::rgb8_pixel_t& key) const
{
size_t ret = (static_cast<size_t>(key[0]) << 16) |
(static_cast<size_t>(key[1]) << 8) |
(static_cast<size_t>(key[2]));
//return 256 * 256 * key[0] + 256 * key[1] + key[2];
return ret;
}
};
Now, this is not funny anymore...
I wrote this hash function:
template<>
struct std::hash<int> :
public std::unary_function<const int&, size_t>
{
size_t operator()(const int& key) const
{
return 5;
}
};
In theory, I should have 100% rate of collisions, right? but the results are:
0/480 -> buckets: 8(12% load factor) [ 639 collisions / 640 hits ]
1/480 -> buckets: 4096(15% load factor) [ 640 collisions / 1280 hits ]
2/480 -> buckets: 4096(23% load factor) [ 960 collisions / 1920 hits ]
3/480 -> buckets: 4096(31% load factor) [ 1281 collisions / 2560 hits ]
4/480 -> buckets: 4096(37% load factor) [ 1654 collisions / 3200 hits ]
5/480 -> buckets: 4096(45% load factor) [ 1964 collisions / 3840 hits ]
6/480 -> buckets: 4096(51% load factor) [ 2370 collisions / 4480 hits ]
7/480 -> buckets: 4096(59% load factor) [ 2674 collisions / 5120 hits ]
8/480 -> buckets: 4096(65% load factor) [ 3083 collisions / 5760 hits ]
9/480 -> buckets: 4096(71% load factor) [ 3460 collisions / 6400 hits ]
Why?
Env: MSVS2010
The unordered_map::hash_function() is a built in function in C++ STL which is used to get the hash function. This hash function is a unary function which takes a single argument only and returns a unique value of type size_t based on it.
No. It's called "unordered" for a reason. If you need to maintain an order of insertion, you've chosen an unsuitable data structure.
std::unordered_map::count Because unordered_map containers do not allow for duplicate keys, this means that the function actually returns 1 if an element with that key exists in the container, and zero otherwise.
Unordered_map IS a hash table actually. You may want to use a better example as, as is the second insert will fail since it has the same key.
colls
is not measuring collisions. If you want to measure collisions, then for each bucket b
in the range [0, bucket_count())
, get bucket_size(b)
. That will tell you how many items are in each bucket. If there are 2 or more items in the bucket, then you have bucket_size(b) - 1
collisions for bucket b
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With