I've spent the week processing some gnarly text files -- some in the hundred million row range.
I've used python to open, parse, transform, and output these files. I've been running the jobs in parallel, often 6 -8 at a time, on a massive 8-processor, 16-core EC2 unit, using SSD.
And I would say that the output is bad on 0.001% of writes, like:
Expected output: |1107|2013-01-01 00:00:00|PS|Johnson|etc.
Actual output: |11072013-01-01 00:00:00|PS|Johnson|etc.
or |1107|2013-01-01 :00:00|PS|Johnson
Almost always, the problem is not GIGO, but rather that Python has failed to write a separator or part of a date field. Thus I assume that I'm overloading the SSD with these jobs, or rather that the computer is failing to throttle python based on write contention for the drive.
My question is this: how do I get the fastest processing from this box yet not induce these kind of "write" errors?
Are you using the multiprocessing
module (separate processes) or just using threads for the parallel processing?
I doubt very much that the SSD is the problem. Or python. But maybe the csv
module has a race condition and isn't thread safe?
Also, check your code. And the inputs. Are the "bad" writes consistent? Can you reproduce them? You mention GIGO, but don't really rule it out ("Almost always, ...").
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With