Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Store/retrieve a data structure

I have implemented a suffix tree in Python to make full-text-searchs, and it's working really well. But there's a problem: the indexed text can be very big, so we won't be able to have the whole structure in RAM.

enter image description here

IMAGE: Suffix tree for the word BANANAS (in my scenario, imagine a tree 100000 times bigger).

So, researching a little bit about it I found the pickle module, a great Python module for "loading" and "dumping" objects from/into files, and guess what? It works wonderfully with my data structure.

So, making the long story shorter: What would be the best strategy to store and retrieve this structure on/from disk? I mean, a solution could be to store each node in a file and load it from disk whenever is needed, but this isn't the best think to do (too many disk accesses).


Footnote: Although I have tagged this question as python, the programming language isn't the important part of the question, the disk storing/retrieving strategy is really the main point.

like image 488
juliomalegria Avatar asked Nov 28 '11 18:11

juliomalegria


People also ask

Which data structure is used for retrieval?

In computer science, a search data structure is any data structure that allows the efficient retrieval of specific items from a set of items, such as a specific record from a database.

Which are an easy way to store and retrieve data?

Hashing is one of the fastest ways to store and retrieve data. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.

What is storage structure in data structure?

The representation of particular data structure in the main memory of a computer is called as storage structure. For Examples: Array, Stack, Queue, Tree, Graph, etc.


1 Answers

An effective way to manage a structure like this is to use a memory-mapped file. In the file, instead of storing references for the node pointers, you store offsets into the file. You can still use pickle to serialise the node data to a stream suitable for storing on disk, but you will want to avoid storing references since the pickle module will want to embed your entire tree all at once (as you've seen).

Using the mmap module, you can map the file into address space and read it just like a huge sequence of bytes. The OS takes care of actually reading from the file and managing file buffers and all the details.

You might store the first node at the start of the file, and have offsets that point to the next node(s). Reading the next node is just a matter of reading from the correct offset in the file.

The advantage of memory-mapped files is that they aren't loaded into memory all at once, but only read from disk when needed. I've done this (on a 64-bit OS) with a 30 GB file on a machine with only 4 GB of RAM installed, and it worked fine.

like image 78
Greg Hewgill Avatar answered Oct 26 '22 23:10

Greg Hewgill