Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to map string to int faster than using hashmap?

I understand that I should not optimize every single spot of my program so please consider this question to be "academic"

I have maximum 100 strings and integer number for each of them, something like that:

MSFT 1 DELL 2 HP   4 .... ABC  58 

This set is preinitialized that means that once created it never changes. After set is initialized I use it pretty intensive so it nice to have fast lookup. Strings are pretty short, maximum 30 characters. Mapped int is also limited and between 1 and 100.

At least knowing that strings are preinitialized and never change it should be possible to "find" hash-function that results in "one basket-one item" mapping, but probably there are other hacks.

One optimization I can imagine - i can read first symbol only. For example if "DELL" is the only string starting with "D" and I have received something like "D***" than I do not need even to read the string! it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup". (well here I assumed that we receive only symbols that in hash, but it is not always the case)

Are there any ready to use or easy to implement solutions for my problem? i'm using c++ and boost.

upd I've check and found that for my exchange limit for ticker is 12 symbols, not 30 as mentioned above. However other exchanges may allow slighty longer symbols so it's interesting to have algorithm that will continue working on up to 20-charachters long tickers.

like image 982
Oleg Vazhnev Avatar asked Apr 22 '13 06:04

Oleg Vazhnev


2 Answers

A hashtable[1] is in principle the fastest way.

You could however compile a Perfect Hash Function given the fact that you know the full domain ahead of time.

With a perfect hash, there need not be a collision, so you can store the hash table in a linear array!

With proper tweaking you can then

  • fit all of the hash elements in a limited space, making direct addressing a potential option
  • have a reverse lookup in O(1)

The 'old-school' tool for generating Perfect Hash functions would be gperf(1). The wikipedia lists more resources on the subject.

Because of all the debate I ran a demo:

Downloading NASDAQ ticker symbols and getting 100 random samples from that set, applying gperf as follows:

gperf -e ' \015' -L C++ -7 -C -E -k '*,1,$' -m 100 selection > perfhash.cpp 

Results in a hash-value MAX_HASH_VALUE of 157 and a direct string lookup table of as many items. Here's just the hash function for demonstration purposes:

inline unsigned int Perfect_Hash::hash (register const char *str, register unsigned int len) {   static const unsigned char asso_values[] = {       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156,  64,  40,   1,  62,   1,        41,  18,  47,   0,   1,  11,  10,  57,  21,   7,        14,  13,  24,   3,  33,  89,  11,   0,  19,   5,        12,   0, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156, 156,       156, 156, 156, 156, 156, 156, 156, 156, 156     };   register int hval = len;    switch (hval) {       default: hval += asso_values[(unsigned char)str[4]];   /*FALLTHROUGH*/       case 4:  hval += asso_values[(unsigned char)str[3]];   /*FALLTHROUGH*/       case 3:  hval += asso_values[(unsigned char)str[2]+1]; /*FALLTHROUGH*/       case 2:  hval += asso_values[(unsigned char)str[1]];   /*FALLTHROUGH*/       case 1:  hval += asso_values[(unsigned char)str[0]];   break;   }   return hval; } 

It really doesn't get much more efficient. Do have a look at the full source at github: https://gist.github.com/sehe/5433535

Mind you, this is a perfect hash, too, so there will be no collisions


Q. [...] it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup".

A: If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.


[1]PS. For 100 strings, a sorted array of string with std::search or std::lower_bound would potentially be as fast/faster due to the improved Locality of Reference. Consult your profile results to see whether this applies.

like image 179
sehe Avatar answered Sep 20 '22 08:09

sehe


Small addition to sehe’s post:

If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.

You can harness the prefix search to be much more efficient. The problem with both std::map and naive binary search is that they will read the same prefix redundantly for each individual comparison, making the overall search O(m log n) where m is the length of the search string.

This is the reason why a hashmap outcompetes these two methods for large sets. However, there is a data structure which does not perform redundant prefix comparisons, and in fact needs to compare each prefix exactly once: a prefix (search) tree, more commonly known as trie, and looking up a single string of length m is feasible in O(m), the same asymptotic runtime you get for a hash table with perfect hashing.

Whether a trie or a (direct lookup) hash table with perfect hashing is more efficient for your purpose is a question of profiling.

like image 23
Konrad Rudolph Avatar answered Sep 19 '22 08:09

Konrad Rudolph