I have fairly large 4D arrays [20x20x40x15000] that I save to disk as HDF5 files using h5py. Now the problem is that I want to calculate an average of the entire array i.e. using:
numpy.average(HDF5_file)
I get a MemoryError
. It seems that numpy tries to load the HDF5 file into memory to perform the average?
Does anyone have an elegant and efficient solution to this problem?
Folding 240 000 000 values will need a few lines of code to work effectively. Numpy works by loading all the data into the memory, so you won't be able to load naively the data as you discovered. You will have to divide the problem into chunks, and use a map/reduce approach:
you can use from_buffer with the count & offset args to load part of your data.
edit
You can try to use the python profiler to check what are the relative costs.
If the main cost is the processing, you can try to parallelize it with a process pool from the multiprocess library or a parallel version of numpy. Or using a GPGPU libraries such as pyopencl.
If the processing time is equivalent to the loading time, you can try to pipeline the two tasks using the multiprocessing library. Create one thread to load the data and feed it to the processing thread.
If the main cost is the loading time, you have a bigger problem. you can try to divide the task on different computers (using a grid library to manage the data replication and task distribution).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With