I have a problem in Golang whereby I need to be able to lookup string keys, from about 5,000,000 strings, each of which containing only a-z (lowercase) and 0-9 characters. Similar problem with uint32 and uint64 as keys.
A map (hash table) is perfect for this, but it uses much too much RAM.
There must be known methods for this type of thing, I've been looking into B-Tree but I'm not sure it would be the most efficient mechanism.
Some of the particularities of my problem, which could lead to a more efficient solution, are:
Seeing as it only needs be read-only then it seems to me that having it as a pre-sorted list with a series of indexes, might work well. I thought at first I might be able to just have it in slices with a 36 (26 letters + 10 numbers) index for each level (i.e. character)... but of course that means 36^whatever which ends up being the opposite of efficient. Then I thought maybe I could put only a single index of 36 for each level, but then I end up with a load of arrays/slices that need to be intersected to get the ID of the result.
I guess I'm looking for some kind of very specific B-Tree implementation, but more tuned to my purpose (without the B.)
Does anyone know of anything that exists like I am suggesting?
I'd give a try to a Compressed Trie. It's the data structure perfectly usable in a scenario with lexicographic keys. B-Trees are mostly intended for external memories because they're minimizing the depth of a tree. A trie or a more memory efficient hashing is the right way to go.
You should use a Trie data structure which is designed to map strings to values very efficiently. See https://en.wikipedia.org/wiki/Trie for more information.
A Go based one that is in heavy use as part of the UK Government's new website can be found at https://github.com/alphagov/router/tree/master/trie
I think it depends upon what you are trying to ultimately achieve. For example:
If the answer to question 1 is (a) then a trie is probably a good choice of data structure. If the answer to question 1 is (b) then you are probably best off using a bitset or a bloom filter. Of these two, a bloom filter will be the fastest and most memory efficient but is probabilistic and will yield some false positives (but no true negatives) which may or may not be okay for your use case depending upon your answer to question 2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With