Techniques for working with large Numpy arrays? [duplicate]

Tags:

There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).

Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?

595

asked Jan 16 '13 04:01

Noob Saibot

2 Answers

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.

I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:

Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.

A couple of things you can do to handle this:

1. Divide and conquer

Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.

2. Cache expensive computations

This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.

3. Use your `dtypes` wisely

If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

answered Oct 25 '22 04:10

Jaime

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)

Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.

Third: as pointed out by @Jaime, work un block sub-matrices, if the whole matrix is to big.

EDIT:

Avoid unecessary list comprehension, as pointed out in this answer in SE.

answered Oct 25 '22 05:10

Stefano M

Related questions
                            
                                Does webgl work on chrome on a mac?
                            
                                CKEditor 4 build (minify and uglify)
                            
                                how to make ActiveRecord bump updated_at on new has_many association
                            
                                Gradle basedir property
                            
                                Graph flow chart of transition from states
                            
                                How I know if my input type="file" has content selected
                            
                                Transform IEnumerable<Task<T>> asynchronously by awaiting each task
                            
                                Android Google Map - Clicked marker opens new activity or bigger window
                            
                                Bash: parallelize md5sum checksum on many files
                            
                                How does ToString on an anonymous type work?
                            
                                phantomjs exit() doesn't terminate the process
                            
                                difference between initializing an object with these two ways in c# [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Techniques for working with large Numpy arrays? [duplicate]

Tags:

Noob Saibot

People also ask

2 Answers

1. Divide and conquer

2. Cache expensive computations

3. Use your `dtypes` wisely

Jaime

Stefano M

Recent Activity

Donate For Us

Techniques for working with large Numpy arrays? [duplicate]

Tags:

Noob Saibot

People also ask

2 Answers

1. Divide and conquer

2. Cache expensive computations

3. Use your dtypes wisely

Jaime

Stefano M

Related questions

Recent Activity

Donate For Us

3. Use your `dtypes` wisely