I have implemented a suffix tree in Python to make full-text-searchs, and it's working really well. But there's a problem: the indexed text can be very big, so we won't be able to have the whole structure in RAM.
IMAGE: Suffix tree for the word BANANAS
(in my scenario, imagine a tree 100000 times bigger).
So, researching a little bit about it I found the pickle
module, a great Python module for "loading" and "dumping" objects from/into files, and guess what? It works wonderfully with my data structure.
So, making the long story shorter: What would be the best strategy to store and retrieve this structure on/from disk? I mean, a solution could be to store each node in a file and load it from disk whenever is needed, but this isn't the best think to do (too many disk accesses).
Footnote: Although I have tagged this question as python, the programming language isn't the important part of the question, the disk storing/retrieving strategy is really the main point.
In computer science, a search data structure is any data structure that allows the efficient retrieval of specific items from a set of items, such as a specific record from a database.
Hashing is one of the fastest ways to store and retrieve data. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.
The representation of particular data structure in the main memory of a computer is called as storage structure. For Examples: Array, Stack, Queue, Tree, Graph, etc.
An effective way to manage a structure like this is to use a memory-mapped file. In the file, instead of storing references for the node pointers, you store offsets into the file. You can still use pickle
to serialise the node data to a stream suitable for storing on disk, but you will want to avoid storing references since the pickle
module will want to embed your entire tree all at once (as you've seen).
Using the mmap
module, you can map the file into address space and read it just like a huge sequence of bytes. The OS takes care of actually reading from the file and managing file buffers and all the details.
You might store the first node at the start of the file, and have offsets that point to the next node(s). Reading the next node is just a matter of reading from the correct offset in the file.
The advantage of memory-mapped files is that they aren't loaded into memory all at once, but only read from disk when needed. I've done this (on a 64-bit OS) with a 30 GB file on a machine with only 4 GB of RAM installed, and it worked fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With