I have a sparse array that seems to be too large to handel effectively in memory (2000x2500000, float). I can form it into a sparse lil_array (scipy) but if I try output a column or row compressed sparse array (A.tocsc(), A.tocsr()) my machine runs out of memory (and there's also a serious mismatch between the data in a text file 4.4G and the pickeled lil array 12G - it would be nice to have a disk format that more closely approximates the raw data size).
I will probably be handeling even larger arrays in the future.
Question: What's the best way to handle large on disk arrays in such a way that I can use the regular numpy functions in a transparent way. For instance, sums along rows and columns, vector products, max, min, slicing etc?
Is pytables the way to go? is there a good (fast) sql-numpy middleware layer? a secret on disk array built into numpy?
In the past with (slightly smaller) arrays I've always just pickel-cached long calculated results to disk. This works when the arrays end up being < 4G or so but is not longer tenable.
Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory. A common solution is to use memory mapping and implement out-of-core computations. The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.
Just to be clear: there's no "good" way to extend a NumPy array, as NumPy arrays are not expandable. Once the array is defined, the space it occupies in memory, a combination of the number of its elements and the size of each element, is fixed and cannot be changed.
There is no general maximum array size in numpy. Of course there is, it is the size of np. intp datatype. Which for 32bit versions may only be 32bits...
mode : {'r+', 'r', 'w+', 'c'}, optional The file is opened in this mode: 'r' Open existing file for reading only. 'r+' Open existing file for reading and writing. 'w+' Create or overwrite existing file for reading and writing. 'c' Copy-on-write: assignments affect data in memory, but changes are not saved to disk.
I often use memory-mapped numpy
arrays to process multi-gigabyte numerical matrices. I find them to work really well for my purposes. Obviously, if the size of the data exceeds the amount of RAM, one has to be careful about access patterns to avoid thrashing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With