I am trying to speed up the process of reading chunks (load them into RAM memory) out of a h5py dataset file. Right now I try to do this via the multiprocessing library.
pool = mp.Pool(NUM_PROCESSES)
gen = pool.imap(loader, indices)
Where the loader function is something like this:
def loader(indices):
with h5py.File("location", 'r') as dataset:
x = dataset["name"][indices]
This actually sometimes works (meaning that the expected loading time is divided by the number of processes and thus parallelized). However, most of the time it doesn't and the loading time just stays as high as it was when loading the data sequentially. Is there anything I can do to fix this? I know h5py supports parallel read/writes through mpi4py but I would just want to know if that is absolutely necessary for only reads as well.
Parallel reads are fine with h5py, no need for the MPI version. But why do you expect a speed-up here? Your job is almost entirely I/O bound, not CPU bound. Parallel processes are not gonna help because the bottleneck is your hard disk, not the CPU. It wouldn't surprise me if parallelization in this case even slowed down the whole reading operation. Other opinions?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With