Store a lot of data inside python

Tags:

python

Maybe I start will a small introduction for my problem. I'm writing a python program which will be used for post-processing of different physical simulations. Every simulation can create up to 100 GB of output. I deal with different informations (like positions, fields and densities,...) for different time steps. I would like to have the access to all this data at once which isn't possible because I don't have enough memory on my system. Normally I use read file and then do some operations and clear the memory. Then I read other data and do some operations and clear the memory.

Now my problem. If I do it this way, then I spend a lot of time to read data more than once. This take a lot of time. I would like to read it only once and store it for an easy access. Do you know a method to store a lot of data which is really fast or which doesn't need a lot of space.

I just created a method which is around ten times faster then a normal open-read. But I use cat (linux command) for that. It's a really dirty method and I would like to kick it out of my script.

Is it possible to use databases to store this data and to get the data faster than normal reading? (sorry for this question, but I'm not a computer scientist and I don't have a lot of knowledge behind databases).

EDIT:

My cat-code look something like this - only a example:

out = string.split(os.popen("cat "+base+"phs/phs01_00023_"+time).read())
# and if I want to have this data as arrays then I normally use and reshape (if I
# need it)
out = array(out)
out = reshape(out)

Normally I would use a numpy Method numpy.loadtxt which need the same time like normal reading.:

f = open('filename')
f.read()
...

I think that loadtxt just use the normal methods with some additional code lines.

I know there are some better ways to read out data. But everything what I found was really slow. I will now try mmap and hopefully I will have a better performance.

679

asked Mar 10 '11 18:03

ahelm

2 Answers

If you're on a 64-bit operating system, you can use the mmap module to map that entire file into memory space. Then, reading random bits of the data can be done a lot more quickly since the OS is then responsible for managing your access patterns. Note that you don't actually need 100 GB of RAM for this to work, since the OS will manage it all in virtual memory.

I've done this with a 30 GB file (the Wikipedia XML article dump) on 64-bit FreeBSD 8 with very good results.

105

answered Nov 11 '22 09:11

Greg Hewgill

I would try using HDF5. There are two commonly used Python interfaces, h5py and PyTables. While the latter seems to be more widespread, I prefer the former.

answered Nov 11 '22 10:11

Sven Marnach

Related questions
                            
                                Using cython .pxd files to Augment pure python files
                            
                                Choosing multicast network interface in Python
                            
                                CherryPy combine file and dictionary based configuration
                            
                                OSS implementation of Google app engine?
                            
                                printing to a file in Python: redirect vs print's file argument vs write
                            
                                Remove characters from beginning and end or only end of line
                            
                                Making Matplotlib run faster
                            
                                pyPdf unable to extract text from some pages in my PDF
                            
                                How to override Python list(iterator) behaviour?
                            
                                Fastest Way to Round to the Nearest 5/100ths
                            
                                How to dump a boolean matrix in numpy?
                            
                                Python regex convert youtube url to youtube video
                            
                                Remove all articles, connector words, etc., from a string in Python
                            
                                SQLAlchemy: query custom property based on table field
                            
                                What is faster for loop using enumerate or for loop using xrange in Python?
                            
                                Automatic image scaling on resize with (Py)GTK
                            
                                Iterate through parts of a string
                            
                                Best approach to use jira programmatically
                            
                                Python + nose: make assertions about logged text?
                            
                                How can I let android emulator talk to the localhost?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With