I have a Python program that processes fairly large NumPy arrays (in the hundreds of megabytes), which are stored on disk in pickle files (one ~100MB array per file). When I want to run a query on the data I load the entire array, via pickle, and then perform the query (so that from the perspective of the Python program the entire array is in memory, even if the OS is swapping it out). I did this mainly because I believed that being able to use vectorized operations on NumPy arrays would be substantially faster than using for loops through each item.
I'm running this on a web server which has memory limits that I quickly run up against. I have many different kinds of queries that I run on the data so writing "chunking" code which loads portions of the data from separate pickle files, processes them, and then proceeds to the next chunk would likely add a lot of complexity. It'd definitely be preferable to make this "chunking" transparent to any function that processes these large arrays.
It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?
PyTables is a package for managing hierarchical datasets. It is designed to solve this problem for you.
NumPy's memory-mapped data structure (memmap) might be a good choice here.
You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.
(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)
The method signature is:
A = NP.memmap(filename, dtype, mode, shape, order='C')
All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.
The two methods are:
flush (which writes to disk any changes you make to the array); and
close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)
example use:
import numpy as NP
from tempfile import mkdtemp
import os.path as PH
my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")
fname = PH.join(mkdtemp(), 'tempfile.dat')
mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)
# now write the data to the memmap array:
mm_obj[:] = data[:]
# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))
# verify that it's there!:
print(mm_obj[:20,:])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With