Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When does it make sense to presize a hash?

Tags:

hash

perl

From perldata:

You can preallocate space for a hash by assigning to the keys() function.
This rounds up the allocated buckets to the next power of two:

   keys(%users) = 1000;      # allocate 1024 buckets

Is there a rule of thumb for when presizing a hash will improve performance?

like image 677
Eugene Yarmash Avatar asked Aug 09 '11 15:08

Eugene Yarmash


People also ask

When should a hash table be resized?

A hash table needs to be resized if load factor of a table exceeds 0.7.

How hash tables grow and shrink?

The usable size of a hash table is the number of associations that the table can hold at a given time. If the number of associations in the table exceeds the usable size, the table will automatically grow, increasing the usable size to a new value that is sufficient to hold the associations.

What is the load factor of a hash table?

Overview. Load factor is defined as (m/n) where n is the total size of the hash table and m is the preferred number of entries which can be inserted before a increment in size of the underlying data structure is required.

How do I resize a hash table?

Resizing a hash table consists of choosing a new hash function to map to the new size, creating a hash table of the new size, iterating through the elements of the old table, and inserting them into the new table.


3 Answers

The rule of thumb is that the larger you know the Hash will be, the more likely you'll get value out of pre-sizing it. Consider if your hash has 10 slots, and you start adding one after the other, the number of expansions will a) be few (if at all), and b) small (since there is little data).

But if you KNOW you're going to need at least 1M items, then there's no reason to expand, and copy the underlying and ever expanding data structures over and over while the table grows.

Will you NOTICE this expansion? Eh, maybe. Modern machines are pretty darn fast, it may not come up. But it's a grand opportunity for heap expansion, thus causing a GC and a cascade of all sorts of things. So, if you know you're going to use it, it's a "cheap" fix to tweak out a few more milibleems of performance.

like image 76
Will Hartung Avatar answered Sep 30 '22 13:09

Will Hartung


I tried to benchmark expansion cost on hash growing:

use Benchmark qw(cmpthese);

# few values
cmpthese(-4, {
    prealloc => sub {
        my %hash;
        keys(%hash) = 17576;
        $hash{$_} = $_ for 'aaa' .. 'zzz';
    },
    normal   => sub {
        my %hash;
        $hash{$_} = $_ for 'aaa' .. 'zzz';
    },
});

# more values
cmpthese(-8, {
    prealloc => sub {
        my %hash;
        keys(%hash) = 456976;
        $hash{$_} = $_ for 'aaaa' .. 'zzzz';
    },
    normal   => sub {
        my %hash;
        $hash{$_} = $_ for 'aaaa' .. 'zzzz';
    },
});

Results does not sound like big optimization, however reducing heap fragmentation mentioned by Will Hartung might be benefit. Running perl 5.12 on WinXP machine.

       Rate   normal prealloc
normal   48.3/s       --      -2%
prealloc 49.4/s       2%       --
        (warning: too few iterations for a reliable count)
     s/iter   normal prealloc
normal     3.62       --      -1%
prealloc   3.57       1%       --
like image 37
bvr Avatar answered Sep 30 '22 14:09

bvr


Basically it is the door to optimize hash performance. Hash performance depends heavily both on the hashing algorithm used and on the data you are handling, so it is almost impossible to come up with rule of thumbs. Anyway, something can be said.

You know that each data structure offers a given balance between space and time efficiency. Hash tables are especially good as to time efficiency, offering an appealing constant (0(1)) time access.

This holds true unless there is a collision. When a collision happens, then access time is linear with the size of the bucket corresponding to the collision value. (Have a look at this for more details). Collisions, apart from being "slower", are mostly a disruption of the access time guarantee that is the single most important aspect that often leads to choosing a hash table in the first place.

Ideally, hash tables could aim at what is known as "perfect hashing" (which is actually feasible only when you can fine-tune the algorithm to the kind of data you will handle), but this is not so easy to attain in the general case (this is an euphemism, actually). Anyway, it is a matter of fact that bigger hash tables (together with a good hashing algorithm) can reduce the frequency of collisions, and thus improve performance, at the expense of memory. Smaller hash tables will see more collisions (hence will have less performance and a lesser quality access time guarantee) but occupy less memory.

So, if you profile your program and see that hash table access is a bottleneck (for any reasons) you have a chance to solve this by reserving more memory for the hash space (if you have memory to give).

In any case, I would not increase this value at random, but only after thorough profiling, since it is also true that the algorithm perl uses is compiled in (AFAIK) and this also has a big effect on hash performance (in other words, you could have a lot of collisions even if you make the hash space bigger).

As usual with performance related things, it could be useful or not, it depends on your concrete case.

like image 22
sergio Avatar answered Sep 30 '22 13:09

sergio