Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I optimize this filesystem I/O bound program?

I have a python program that does something like this:

  1. Read a row from a csv file.
  2. Do some transformations on it.
  3. Break it up into the actual rows as they would be written to the database.
  4. Write those rows to individual csv files.
  5. Go back to step 1 unless the file has been totally read.
  6. Run SQL*Loader and load those files into the database.

Step 6 isn't really taking much time at all. It seems to be step 4 that's taking up most of the time. For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind.

There are a few ideas that I have to solve this:

  1. Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn't?
  2. Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
  3. Break the load file up into separate chunks and process them in parallel. The rows don't need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.

Of course, the correct answer to this question is "do what you find to be the fastest by testing." However, I'm mainly trying to get an idea of where I should spend my time first. Does anyone with more experience in these matters have any advice?

like image 613
Jason Baker Avatar asked Dec 04 '25 00:12

Jason Baker


2 Answers

Poor man's map-reduce:

Use split to break the file up into as many pieces as you have CPUs.

Use batch to run your muncher in parallel.

Use cat to concatenate the results.

like image 106
Jonathan Feinberg Avatar answered Dec 05 '25 14:12

Jonathan Feinberg


Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.

If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.

Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.

like image 33
Ants Aasma Avatar answered Dec 05 '25 15:12

Ants Aasma



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!