I'm wondering about the trade-offs between reading files in sequence vs. in parallel.
Let's say I have a million megabyte-sized files that I would like to process, but not enough memory to hold all of them at once. To process these sequentially, I can do:
results = [do_something(os.path.join(files, f)) for f in os.listdir(files)]
Or I can do it in parallel:
paths = [os.path.join(files, f) for f in os.listdir(files)]
p = multiprocessing.Pool()
try:
results = p.map(do_something, paths)
p.close()
p.join()
except KeyboardInterrupt:
p.terminate()
In general I've been cautioned against performing parallel I/O because random disk reading is quite slow. But in this case is parallel the way to go? Or perhaps some mixed strategy?
Also, I notice that the parallel version preserves the structure of the directory; that is to say, the output is in the correct order. Does that mean that it's actually doing it sequentially, or is python just being kind? Edit: Blender cleared this second question up. Thanks, Blender!
Thanks for the help.
Parallel processing will be hurt by Disk IO if you have multiple disk accesses per file. However, if you are doing little enough processing in do_something
it might not be worth the processing needed for the context switching that will occur in the thread pool. Since you say that do_something
is significantly expensive, it is probably worth processing in parallel.
Also, you can minimize the Disk IO if you just read the files entirely into memory once vs reading the file line by line. Of course this will require more memory, but it will probably decrease the processing time significantly.
It partly depends on the type of storage medium they're on. A conventional hard drive will crawl nearly to a halt due to seek activity. An SSD, OTOH, is much less susceptible to random reads (though it isn't entirely unaffected).
Even if you have an SSD, you might find that there's a point of diminishing returns, though the default Pool size is probably fine, and you may even find that the sweet-spot is much higher than cpu_count()
. There are too many factors to make any predictions, so you should try different pool sizes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With