limits of python in parallel file processing

Question

I've spent the week processing some gnarly text files -- some in the hundred million row range.

I've used python to open, parse, transform, and output these files. I've been running the jobs in parallel, often 6 -8 at a time, on a massive 8-processor, 16-core EC2 unit, using SSD.

And I would say that the output is bad on 0.001% of writes, like:

 Expected output:  |1107|2013-01-01 00:00:00|PS|Johnson|etc.

 Actual output:    |11072013-01-01 00:00:00|PS|Johnson|etc.
               or  |1107|2013-01-01 :00:00|PS|Johnson

Almost always, the problem is not GIGO, but rather that Python has failed to write a separator or part of a date field. Thus I assume that I'm overloading the SSD with these jobs, or rather that the computer is failing to throttle python based on write contention for the drive.

My question is this: how do I get the fastest processing from this box yet not induce these kind of "write" errors?

Daren Thomas · Accepted Answer

Are you using the multiprocessing module (separate processes) or just using threads for the parallel processing?

I doubt very much that the SSD is the problem. Or python. But maybe the csv module has a race condition and isn't thread safe?

Also, check your code. And the inputs. Are the "bad" writes consistent? Can you reproduce them? You mention GIGO, but don't really rule it out ("Almost always, ...").

limits of python in parallel file processing

Tags:

python

Todd Curry

1 Answers

Daren Thomas

Recent Activity

Donate For Us

limits of python in parallel file processing

Tags:

python

Todd Curry

1 Answers

Daren Thomas

Related questions

Recent Activity

Donate For Us