There are times when you have to perform many intermediate operations on one, or more, large Numpy
arrays. This can quickly result in MemoryError
s. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect()
are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?
Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory. A common solution is to use memory mapping and implement out-of-core computations. The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.
Creating arrays with more than one dimensionIn general numpy arrays can have more than one dimension. One way to create such array is to start with a 1-dimensional array and use the numpy reshape() function that rearranges elements of that array into a new shape.
NumPy: repeat() function The repeat() function is used to repeat elements of an array. Input array. The number of repetitions for each element. repeats is broadcasted to fit the shape of the given axis.
By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.
I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
A couple of things you can do to handle this:
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
dtypes
wiselyIf your intermediate values can fit in a single uint8
, don't use an array of int32
s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.
First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap
and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by @Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With