Does anyone know how I might scale a Trie across multiple machines? Say the first machine runs out of space and I need to add more words from a very large dictionary, what might I do to add more words? (I am a Java thinker, but I believe the answer can be language agnostic). I have already realized that I cannot just say one machine for each first character, but that doesn't really scale.
The idea to do this is to start traversing from the root node of trie, whenever we find a NON-NULL child node, we add parent key of child node in the “string str” at the current index(level) and then recursively call the same process for the child node and same goes on till we find the node which is a leafnode, which ...
In a trie indexing an alphabet of 26 letters, each node has 26 possible children and, therefore, 26 possible pointers.
Two options are available to store the data: Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB [4] are good fits for serialized data.
A trie is a tree-like data structure whose nodes store the letters of an alphabet. By structuring the nodes in a particular way, words and strings can be retrieved from the structure by traversing down a branch path of the tree. Tries in the context of computer science are a relatively new thing.
Ok, given the assumption, that both of your machines have the same resources available, let’s first look at a simpler example:
how would you scale a binary tree? Or even better - an AVL tree? There are several examples to do this:
(note that balancing such a distributed tree will be much more complicated, as you’ll need to communicate with other machines and do it possibly inside a distributed transaction, to be able to answer all requests concurrently)
So, now a trie, which - AFAIR - is a tree / letter. If the letters in your words would be distributed evenly, you could have A-M on one machine and N-Z on the other. This will probably not work, but you’ll for sure be able to split it more or less 50/50 like this.
If you now want to add more and more machines, I’d keep a main node which would work as a load balancer and distribute it to the child nodes, which would only take care of few letters. For instance you could have nodes
Assuming, you have roughly as much data for the letters A-F as you have for the letter S. (There actually might be a language, where this would be at least close to the most optimal distribution)
Now if you get too many letters in A-F you can just split it into A-D and E-F for instance, nothing really changes there. The problem will be if you’d get too many letters in S. Now you’d have 3 possibilities:
You modify the load root load balancer to be able to specify more complex boundaries between nodes, such as you’d have now the nodes
Here number 1 is probably the easiest and cleanest solution, but might have some unused hardware. In case you can use different resources for nodes, option 1 with a small load balancer for the letter S would be probably the way to go. Option 2 is a dirty mix, and option 3 might be the nicest way to go, but it makes the load balancer potentially complicated and error prone.
Hope this ideas help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With