I'm using something like this:
find folder/ | xargs -n1 -P10 ./logger.py > collab
Inside logger.py
I am processing the files out outputting reformatted lines. So collab should look like
{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
Instead sometimes the lines are getting jumbled:
{'filename' : 'file1', 'size' : 1000}
{'file
{'filename' : 'file1', 'size' : 1000}
name' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
How can I prevent / correct this?
In general, there are issues that make it very hard to guarantee this won't happen, without delving into multi-process locking. However, you can usually reduce the problem a lot.
The most common cause of this is I/O buffering, within Python or libc. For example, it may be buffering 16k of output, and then writing the whole block at once. You can reduce that by flushing stdout after writing to it, but that's awkward. In theory, you should be able to pass -u
to Python to disable stdout buffering, but that didn't work when I tried it. See Sebastjan's answer to Disable output buffering for a more generic solution (though there's probably a way to disable output buffering more directly).
A second problem is that the underlying writes aren't always atomic. In particular, writes to pipes are only atomic up to a certain size (PIPE_BUF, usually 512 bytes); above that it's not guaranteed. That only strictly applies to pipes (not files), but the same general issues apply: smaller writes are more likely to happen atomically. See http://www.opengroup.org/onlinepubs/000095399/functions/write.html.
The complicated, and technically correct, solution would be to implement a mutex for writing, but that's suboptimal, I think.
And its not fun anyway. How about piping the output from xargs (that way you get solid chunks of output, instead of a stream of output that gets broken up) and then combine these chunks somehow?
The problem is that the output from xargs is mixed together. GNU Parallel is made for solving that problem. By default it guarantees output is not mixed together. So you can simply do this:
find folder/ | parallel ./logger.py > collab
This will run one logger.py per CPU. If you want 10:
find folder/ | parallel -P10 ./logger.py > collab
Watch the introvideo to learn more about GNU Parallel http://www.youtube.com/watch?v=OpaiGYxkSuQ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With