I understand that I should not optimize every single spot of my program so please consider this question to be "academic"
I have maximum 100 strings and integer number for each of them, something like that:
MSFT 1 DELL 2 HP 4 .... ABC 58
This set is preinitialized that means that once created it never changes. After set is initialized I use it pretty intensive so it nice to have fast lookup. Strings are pretty short, maximum 30 characters. Mapped int
is also limited and between 1 and 100.
At least knowing that strings are preinitialized and never change it should be possible to "find" hash-function that results in "one basket-one item" mapping, but probably there are other hacks.
One optimization I can imagine - i can read first symbol only. For example if "DELL" is the only string starting with "D" and I have received something like "D***" than I do not need even to read the string! it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup". (well here I assumed that we receive only symbols that in hash, but it is not always the case)
Are there any ready to use or easy to implement solutions for my problem? i'm using c++ and boost.
upd I've check and found that for my exchange limit for ticker is 12 symbols, not 30 as mentioned above. However other exchanges may allow slighty longer symbols so it's interesting to have algorithm that will continue working on up to 20-charachters long tickers.
A hashtable[1] is in principle the fastest way.
You could however compile a Perfect Hash Function given the fact that you know the full domain ahead of time.
With a perfect hash, there need not be a collision, so you can store the hash table in a linear array!
With proper tweaking you can then
The 'old-school' tool for generating Perfect Hash functions would be gperf(1). The wikipedia lists more resources on the subject.
Because of all the debate I ran a demo:
Downloading NASDAQ ticker symbols and getting 100 random samples from that set, applying gperf as follows:
gperf -e ' \015' -L C++ -7 -C -E -k '*,1,$' -m 100 selection > perfhash.cpp
Results in a hash-value MAX_HASH_VALUE of
157
and a direct string lookup table of as many items. Here's just the hash function for demonstration purposes:inline unsigned int Perfect_Hash::hash (register const char *str, register unsigned int len) { static const unsigned char asso_values[] = { 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 64, 40, 1, 62, 1, 41, 18, 47, 0, 1, 11, 10, 57, 21, 7, 14, 13, 24, 3, 33, 89, 11, 0, 19, 5, 12, 0, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156, 156 }; register int hval = len; switch (hval) { default: hval += asso_values[(unsigned char)str[4]]; /*FALLTHROUGH*/ case 4: hval += asso_values[(unsigned char)str[3]]; /*FALLTHROUGH*/ case 3: hval += asso_values[(unsigned char)str[2]+1]; /*FALLTHROUGH*/ case 2: hval += asso_values[(unsigned char)str[1]]; /*FALLTHROUGH*/ case 1: hval += asso_values[(unsigned char)str[0]]; break; } return hval; }
It really doesn't get much more efficient. Do have a look at the full source at github: https://gist.github.com/sehe/5433535
Mind you, this is a perfect hash, too, so there will be no collisions
Q. [...] it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup".
A: If you use a simple std::map
the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.
[1]PS. For 100 strings, a sorted array of string with std::search
or std::lower_bound
would potentially be as fast/faster due to the improved Locality of Reference. Consult your profile results to see whether this applies.
Small addition to sehe’s post:
If you use a simple
std::map
the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.
You can harness the prefix search to be much more efficient. The problem with both std::map
and naive binary search is that they will read the same prefix redundantly for each individual comparison, making the overall search O(m log n) where m is the length of the search string.
This is the reason why a hashmap outcompetes these two methods for large sets. However, there is a data structure which does not perform redundant prefix comparisons, and in fact needs to compare each prefix exactly once: a prefix (search) tree, more commonly known as trie, and looking up a single string of length m is feasible in O(m), the same asymptotic runtime you get for a hash table with perfect hashing.
Whether a trie or a (direct lookup) hash table with perfect hashing is more efficient for your purpose is a question of profiling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With