Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

limits of python in parallel file processing

Tags:

python

I've spent the week processing some gnarly text files -- some in the hundred million row range.

I've used python to open, parse, transform, and output these files. I've been running the jobs in parallel, often 6 -8 at a time, on a massive 8-processor, 16-core EC2 unit, using SSD.

And I would say that the output is bad on 0.001% of writes, like:

 Expected output:  |1107|2013-01-01 00:00:00|PS|Johnson|etc.

 Actual output:    |11072013-01-01 00:00:00|PS|Johnson|etc.
               or  |1107|2013-01-01 :00:00|PS|Johnson

Almost always, the problem is not GIGO, but rather that Python has failed to write a separator or part of a date field. Thus I assume that I'm overloading the SSD with these jobs, or rather that the computer is failing to throttle python based on write contention for the drive.

My question is this: how do I get the fastest processing from this box yet not induce these kind of "write" errors?

like image 260
Todd Curry Avatar asked Jul 26 '13 12:07

Todd Curry


1 Answers

Are you using the multiprocessing module (separate processes) or just using threads for the parallel processing?

I doubt very much that the SSD is the problem. Or python. But maybe the csv module has a race condition and isn't thread safe?

Also, check your code. And the inputs. Are the "bad" writes consistent? Can you reproduce them? You mention GIGO, but don't really rule it out ("Almost always, ...").

like image 83
Daren Thomas Avatar answered Sep 18 '22 03:09

Daren Thomas