Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequential or parallel: what is the proper way to read multiple files in python?

I'm wondering about the trade-offs between reading files in sequence vs. in parallel.

Let's say I have a million megabyte-sized files that I would like to process, but not enough memory to hold all of them at once. To process these sequentially, I can do:

results = [do_something(os.path.join(files, f)) for f in os.listdir(files)]

Or I can do it in parallel:

paths = [os.path.join(files, f) for f in os.listdir(files)]
p = multiprocessing.Pool()
try:
  results = p.map(do_something, paths)
  p.close()
  p.join()
except KeyboardInterrupt:
  p.terminate()

In general I've been cautioned against performing parallel I/O because random disk reading is quite slow. But in this case is parallel the way to go? Or perhaps some mixed strategy?

Also, I notice that the parallel version preserves the structure of the directory; that is to say, the output is in the correct order. Does that mean that it's actually doing it sequentially, or is python just being kind? Edit: Blender cleared this second question up. Thanks, Blender!

Thanks for the help.

like image 888
rhombidodecahedron Avatar asked Oct 04 '22 22:10

rhombidodecahedron


2 Answers

Parallel processing will be hurt by Disk IO if you have multiple disk accesses per file. However, if you are doing little enough processing in do_something it might not be worth the processing needed for the context switching that will occur in the thread pool. Since you say that do_something is significantly expensive, it is probably worth processing in parallel.

Also, you can minimize the Disk IO if you just read the files entirely into memory once vs reading the file line by line. Of course this will require more memory, but it will probably decrease the processing time significantly.

like image 98
Gus E Avatar answered Oct 07 '22 18:10

Gus E


It partly depends on the type of storage medium they're on. A conventional hard drive will crawl nearly to a halt due to seek activity. An SSD, OTOH, is much less susceptible to random reads (though it isn't entirely unaffected).

Even if you have an SSD, you might find that there's a point of diminishing returns, though the default Pool size is probably fine, and you may even find that the sweet-spot is much higher than cpu_count(). There are too many factors to make any predictions, so you should try different pool sizes.

like image 25
Marcelo Cantos Avatar answered Oct 07 '22 20:10

Marcelo Cantos