Store/retrieve a data structure

Tags:

I have implemented a suffix tree in Python to make full-text-searchs, and it's working really well. But there's a problem: the indexed text can be very big, so we won't be able to have the whole structure in RAM.

enter image description here

IMAGE: Suffix tree for the word BANANAS (in my scenario, imagine a tree 100000 times bigger).

So, researching a little bit about it I found the pickle module, a great Python module for "loading" and "dumping" objects from/into files, and guess what? It works wonderfully with my data structure.

So, making the long story shorter: What would be the best strategy to store and retrieve this structure on/from disk? I mean, a solution could be to store each node in a file and load it from disk whenever is needed, but this isn't the best think to do (too many disk accesses).

Footnote: Although I have tagged this question as python, the programming language isn't the important part of the question, the disk storing/retrieving strategy is really the main point.

488

asked Nov 28 '11 18:11

juliomalegria

1 Answers

An effective way to manage a structure like this is to use a memory-mapped file. In the file, instead of storing references for the node pointers, you store offsets into the file. You can still use pickle to serialise the node data to a stream suitable for storing on disk, but you will want to avoid storing references since the pickle module will want to embed your entire tree all at once (as you've seen).

Using the mmap module, you can map the file into address space and read it just like a huge sequence of bytes. The OS takes care of actually reading from the file and managing file buffers and all the details.

You might store the first node at the start of the file, and have offsets that point to the next node(s). Reading the next node is just a matter of reading from the correct offset in the file.

The advantage of memory-mapped files is that they aren't loaded into memory all at once, but only read from disk when needed. I've done this (on a 64-bit OS) with a 30 GB file on a machine with only 4 GB of RAM installed, and it worked fine.

answered Oct 26 '22 23:10

Greg Hewgill

Related questions
                            
                                How can I update only certain fields in a Django model form?
                            
                                invoking yield for a generator in another function
                            
                                Generating Python soaplib stubs from WSDL
                            
                                Does Python support something like literal objects?
                            
                                installing libxml2 on python 2.7 windows
                            
                                How do you make a choices field in django with an editable "other" option?
                            
                                Celery - minimize memory consumption
                            
                                __package__ is None when importing a Python module
                            
                                C++ name mangling by hand
                            
                                Python utilizing multiple processors
                            
                                Programmatically add URL Patterns in Django?
                            
                                Python project and package directories layout
                            
                                Can I use Fabric to perform interactive shell commands?
                            
                                easy_install with various versions of python installed, mac osx
                            
                                Limitations of PyTuple_SetItem
                            
                                Can a Python list, set or dictionary be implemented invisibly using a database?
                            
                                Request and basic profiling information for Flask
                            
                                Testing REST API with database backend
                            
                                Better way to access super class method in multiple inheritance
                            
                                Equivalent of named tuple in NumPy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Store/retrieve a data structure

Tags:

python

data-structures

ram

disk-io

juliomalegria

People also ask

1 Answers

Greg Hewgill

Recent Activity

Donate For Us