Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Go: Efficient string lookup without hash table (aka map)?

Tags:

go

b-tree

I have a problem in Golang whereby I need to be able to lookup string keys, from about 5,000,000 strings, each of which containing only a-z (lowercase) and 0-9 characters. Similar problem with uint32 and uint64 as keys.

A map (hash table) is perfect for this, but it uses much too much RAM.

There must be known methods for this type of thing, I've been looking into B-Tree but I'm not sure it would be the most efficient mechanism.

Some of the particularities of my problem, which could lead to a more efficient solution, are:

  1. The keys need only be strings of a-z0-9 or simple uint values.
  2. Once built it only needs be read-only.

Seeing as it only needs be read-only then it seems to me that having it as a pre-sorted list with a series of indexes, might work well. I thought at first I might be able to just have it in slices with a 36 (26 letters + 10 numbers) index for each level (i.e. character)... but of course that means 36^whatever which ends up being the opposite of efficient. Then I thought maybe I could put only a single index of 36 for each level, but then I end up with a load of arrays/slices that need to be intersected to get the ID of the result.

I guess I'm looking for some kind of very specific B-Tree implementation, but more tuned to my purpose (without the B.)

Does anyone know of anything that exists like I am suggesting?

like image 386
Alasdair Avatar asked Aug 04 '14 18:08

Alasdair


3 Answers

I'd give a try to a Compressed Trie. It's the data structure perfectly usable in a scenario with lexicographic keys. B-Trees are mostly intended for external memories because they're minimizing the depth of a tree. A trie or a more memory efficient hashing is the right way to go.

like image 115
eeq Avatar answered Nov 02 '22 19:11

eeq


You should use a Trie data structure which is designed to map strings to values very efficiently. See https://en.wikipedia.org/wiki/Trie for more information.

A Go based one that is in heavy use as part of the UK Government's new website can be found at https://github.com/alphagov/router/tree/master/trie

like image 24
Ian Davis Avatar answered Nov 02 '22 19:11

Ian Davis


I think it depends upon what you are trying to ultimately achieve. For example:

  1. Do you want to (a) look up some value associated with the string (e.g. an index or cache) or (b) simply test whether the string is a member of a set of strings (e.g. an exclusion list)?
  2. How accurate does the lookup need to be - can you live with false positives?

If the answer to question 1 is (a) then a trie is probably a good choice of data structure. If the answer to question 1 is (b) then you are probably best off using a bitset or a bloom filter. Of these two, a bloom filter will be the fastest and most memory efficient but is probabilistic and will yield some false positives (but no true negatives) which may or may not be okay for your use case depending upon your answer to question 2.

like image 38
James Bowman Avatar answered Nov 02 '22 17:11

James Bowman