Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Statistics on huge numpy (HDF5) arrays

I have fairly large 4D arrays [20x20x40x15000] that I save to disk as HDF5 files using h5py. Now the problem is that I want to calculate an average of the entire array i.e. using:

numpy.average(HDF5_file)

I get a MemoryError. It seems that numpy tries to load the HDF5 file into memory to perform the average?

Does anyone have an elegant and efficient solution to this problem?

like image 383
Onlyjus Avatar asked Feb 20 '23 06:02

Onlyjus


1 Answers

Folding 240 000 000 values will need a few lines of code to work effectively. Numpy works by loading all the data into the memory, so you won't be able to load naively the data as you discovered. You will have to divide the problem into chunks, and use a map/reduce approach:

  • select a chunk size (according to memory constraints)
  • divide the data in chunks of this size (either by creating several files, or by loading only one chunk at a time)
  • for each chunk, compute the average and unload the data
  • merge the averages into your final result.

you can use from_buffer with the count & offset args to load part of your data.

edit

You can try to use the python profiler to check what are the relative costs.

If the main cost is the processing, you can try to parallelize it with a process pool from the multiprocess library or a parallel version of numpy. Or using a GPGPU libraries such as pyopencl.

If the processing time is equivalent to the loading time, you can try to pipeline the two tasks using the multiprocessing library. Create one thread to load the data and feed it to the processing thread.

If the main cost is the loading time, you have a bigger problem. you can try to divide the task on different computers (using a grid library to manage the data replication and task distribution).

like image 105
Simon Bergot Avatar answered Feb 22 '23 03:02

Simon Bergot