Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for fixed length string lookup

I have a bunch of strings as keys. Something like...

AAAA ABBA ACEA ALFG
...
...
ZURF [AAA _JFS aKDJ

They are all unique combination of any 4 characters and are all the same length. There are hundreds of thousands of these. I want to perform a lookup and retrieve the value associated with each string.

I currently have it implemented as a hash table, but the main concern is collisions (I've implemented all of the strategies on Wiki).

I am thinking of implementing this as a prefix tree. Given the parameters though (unique, fixed length), I'm wondering if there is a out-of-the-box data structure I can't think of that would be best suited for this...

EDIT: Additionally, all possible combinations are populated once by a data file. Afterwards, lookups happen at wire speed.

like image 766
user962158 Avatar asked Oct 13 '11 18:10

user962158


1 Answers

Since you know all of the strings ahead of time, you can use gperf to generate a perfect hash function, which has no collisions. For example, with the four input strings AAAA ABBA ACEA ALFG, it generated the following hash function (using the command line gperf -L ANSI-C input.txt):

static unsigned int
hash (register const char *str, register unsigned int len)
{
  static unsigned char asso_values[] =
    {
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12,  7,  2,  5, 12, 12,
      12, 12, 12, 12, 12, 12,  0, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12
    };
  return len + asso_values[(unsigned char)str[1]];
}

const char *
in_word_set (register const char *str, register unsigned int len)
{
  static const char * wordlist[] =
    {
      "", "", "", "",
      "ALFG",
      "",
      "ABBA",
      "", "",
      "ACEA",
      "",
      "AAAA"
    };

  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
    {
      register int key = hash (str, len);

      if (key <= MAX_HASH_VALUE && key >= 0)
        {
          register const char *s = wordlist[key];

          if (*str == *s && !strcmp (str + 1, s + 1))
            return s;
        }
    }
  return 0;
}

Which requires a single table lookup, a length comparison, and a string comparison. If you know for sure that the word you're hashing is one of your source words, then you can skip the string comparison.

Expanding the input size from 4 to 10000 randomly-generated strings increases the hash function to just 4 table lookups plus a length comparison and string comparison. But, since the string comparison has to store every source string in it, this comes out to a very large table in the compiled object file (1.4 MB). If you don't need to do the string comparison, you can omit that table.

like image 109
Adam Rosenfield Avatar answered Sep 21 '22 21:09

Adam Rosenfield