Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lazy Evaluation for iterating through NumPy arrays

I have a Python program that processes fairly large NumPy arrays (in the hundreds of megabytes), which are stored on disk in pickle files (one ~100MB array per file). When I want to run a query on the data I load the entire array, via pickle, and then perform the query (so that from the perspective of the Python program the entire array is in memory, even if the OS is swapping it out). I did this mainly because I believed that being able to use vectorized operations on NumPy arrays would be substantially faster than using for loops through each item.

I'm running this on a web server which has memory limits that I quickly run up against. I have many different kinds of queries that I run on the data so writing "chunking" code which loads portions of the data from separate pickle files, processes them, and then proceeds to the next chunk would likely add a lot of complexity. It'd definitely be preferable to make this "chunking" transparent to any function that processes these large arrays.

It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?

like image 477
erich Avatar asked Dec 07 '22 02:12

erich


2 Answers

PyTables is a package for managing hierarchical datasets. It is designed to solve this problem for you.

like image 126
unutbu Avatar answered Dec 28 '22 09:12

unutbu


NumPy's memory-mapped data structure (memmap) might be a good choice here.

You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.

(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)

The method signature is:

A = NP.memmap(filename, dtype, mode, shape, order='C')

All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.

The two methods are:

flush (which writes to disk any changes you make to the array); and

close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)

example use:

import numpy as NP
from tempfile import mkdtemp
import os.path as PH

my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")

fname = PH.join(mkdtemp(), 'tempfile.dat')

mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)

# now write the data to the memmap array:
mm_obj[:] = data[:]

# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))

# verify that it's there!:
print(mm_obj[:20,:])
like image 27
doug Avatar answered Dec 28 '22 07:12

doug