Optimizing my large data code with little RAM

Question

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The easiest way to do this would be to simply read in the whole file as a list of arrays and use np.median(arrays, axis=0). However, I don't have much RAM to work with, so this is not a good option.

So, I tried to stack them pixel-by-pixel, as in I focus on one pixel position (i, j) at a time, then read in each array one by one, appending the value at the given position to a list. Once all the values for a certain position across all arrays are saved, I use np.median and then just have to save that value in a list -- which in the end will have the medians of each pixel position. In the end I can just reshape this to 600x600, and I'll be done. The code for this is below.

import pickle
import time
import numpy as np

filename = 'images.dat' #contains my 50,000 2D numpy arrays

def stack_by_pixel(i, j):
    pixels_at_position = []
    with open(filename, 'rb') as f:
        while True:
            try:
                # Gather pixels at a given position
                array = pickle.load(f)
                pixels_at_position.append(array[i][j])
            except EOFError:
                break
    # Stacking at position (median)
    stacked_at_position = np.median(np.array(pixels_at_position))
    return stacked_at_position

# Form whole stacked image
stacked = []
for i in range(600):
    for j in range(600):
        t1 = time.time()
        stacked.append(stack_by_pixel(i, j))
        t2 = time.time()
        print('Done with element %d, %d: %f seconds' % (i, j, (t2-t1)))

stacked_image = np.reshape(stacked, (600,600))

After seeing some of the time printouts, I realize that this is wildly inefficient. Each completion of a position (i, j) takes about 150 seconds or so, which is not surprising since it is reading about 50,000 arrays one by one. And given that there are 360,000 (i, j) positions in my large arrays, this is projected to take 22 months to finish! Obviously this isn't feasible. But I'm sort of at a loss, because there's not enough RAM available to read in the whole file. Or maybe I could save all the pixel positions at once (a separate list for each position) for the arrays as it opens them one by one, but wouldn't saving 360,000 lists (that are about 50,000 elements long) in Python use a lot of RAM as well?

Any suggestions are welcome for how I could make this run significantly faster without using a lot of RAM. Thanks!

user2699 · Accepted Answer

This is a perfect use case for numpy's memory mapped arrays. Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as

arr = np.load('filename', mmap_mode='r')

For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.

However, you can still loop over each index and calculate the median without running into memory issues.

[[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]

where 50,000 reflects the total number of arrays.

Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).

Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,

arr = np.lib.format.open_memmap(
    'filename',              # File to store in
    mode='w+',               # Specify to create the file and write to it
    dtype=float32,           # Change this to your data's type
    shape=(50000, 600, 600)  # Shape of resulting array
)

Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).

idx = 0
with open(filename, 'rb') as f:
    while True:
        try:
            arr[idx] = pickle.load(f)
            idx += 1
        except EOFError:
            break

Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.

*I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.

** If you believe strangers on the internet.

Dan Mašek · Answer

Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.

My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.

Preparation

In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.

This script will act as a benchmark that we can compare the preprocessing and processing stages against.

import numpy as np
import pickle
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

with open('data/raw_data.pickle', 'wb') as f:
    for i in range(FILE_COUNT):
        data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
        data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
        pickle.dump(data, f)
        print i,

t2 = time.time()
print '
Done in %0.3f seconds' % (t2 - t1)

In a test run this script completed in 372 seconds, generating ~ 10 GB file.

Preprocessing

Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).

import numpy as np
import pickle
import time

# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

outfiles = []
for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    outfiles.append(open(outfilename, 'wb'))


with open('data/raw_data.pickle', 'rb') as f:
    for i in range(FILE_COUNT):
        data = pickle.load(f)
        for j in range(IMAGE_HEIGHT):
            data[j].tofile(outfiles[j])
        print i,

for i in range(IMAGE_HEIGHT):
    outfiles[i].close()

t2 = time.time()
print '
Done in %0.3f seconds' % (t2 - t1)

In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.

Processing

Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.

Finally, use numpy.vstack to reassemble a median image.

import numpy as np
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600

t1 = time.time()

result_rows = []

for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
    median_row = np.median(data, axis=0)
    result_rows.append(median_row)
    print i,

result = np.vstack(result_rows)
print result

t2 = time.time()
print '
Done in %0.3f seconds' % (t2 - t1)

In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.

Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K @ 3.4GHz, using 32bit Python on purpose.

Optimizing my large data code with little RAM

Tags:

python

arrays

optimization

python-3.x

numpy

curious_cosmo

2 Answers

user2699

Preparation

Preprocessing

Processing

Dan Mašek

Recent Activity

Donate For Us

Optimizing my large data code with little RAM

Tags:

python

arrays

optimization

python-3.x

numpy

curious_cosmo

2 Answers

user2699

Preparation

Preprocessing

Processing

Dan Mašek

Related questions

Recent Activity

Donate For Us