Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to do parallel reads on one h5py file using multiprocessing?

I am trying to speed up the process of reading chunks (load them into RAM memory) out of a h5py dataset file. Right now I try to do this via the multiprocessing library.

pool = mp.Pool(NUM_PROCESSES)
gen = pool.imap(loader, indices)

Where the loader function is something like this:

def loader(indices):
    with h5py.File("location", 'r') as dataset:
        x = dataset["name"][indices]

This actually sometimes works (meaning that the expected loading time is divided by the number of processes and thus parallelized). However, most of the time it doesn't and the loading time just stays as high as it was when loading the data sequentially. Is there anything I can do to fix this? I know h5py supports parallel read/writes through mpi4py but I would just want to know if that is absolutely necessary for only reads as well.

like image 252
Baptist Avatar asked Mar 25 '15 09:03

Baptist


1 Answers

Parallel reads are fine with h5py, no need for the MPI version. But why do you expect a speed-up here? Your job is almost entirely I/O bound, not CPU bound. Parallel processes are not gonna help because the bottleneck is your hard disk, not the CPU. It wouldn't surprise me if parallelization in this case even slowed down the whole reading operation. Other opinions?

like image 65
weatherfrog Avatar answered Nov 12 '22 14:11

weatherfrog