Joining concurrent Python Output

Question

I'm using something like this:

find folder/ | xargs -n1 -P10 ./logger.py > collab

Inside logger.py I am processing the files out outputting reformatted lines. So collab should look like

{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}

Instead sometimes the lines are getting jumbled:

{'filename' : 'file1', 'size' : 1000}
{'file
{'filename' : 'file1', 'size' : 1000}
name' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}

How can I prevent / correct this?

Glenn Maynard · Accepted Answer

In general, there are issues that make it very hard to guarantee this won't happen, without delving into multi-process locking. However, you can usually reduce the problem a lot.

The most common cause of this is I/O buffering, within Python or libc. For example, it may be buffering 16k of output, and then writing the whole block at once. You can reduce that by flushing stdout after writing to it, but that's awkward. In theory, you should be able to pass -u to Python to disable stdout buffering, but that didn't work when I tried it. See Sebastjan's answer to Disable output buffering for a more generic solution (though there's probably a way to disable output buffering more directly).

A second problem is that the underlying writes aren't always atomic. In particular, writes to pipes are only atomic up to a certain size (PIPE_BUF, usually 512 bytes); above that it's not guaranteed. That only strictly applies to pipes (not files), but the same general issues apply: smaller writes are more likely to happen atomically. See http://www.opengroup.org/onlinepubs/000095399/functions/write.html.

Oren Mazor · Answer

The complicated, and technically correct, solution would be to implement a mutex for writing, but that's suboptimal, I think.

And its not fun anyway. How about piping the output from xargs (that way you get solid chunks of output, instead of a stream of output that gets broken up) and then combine these chunks somehow?

Ole Tange · Answer

The problem is that the output from xargs is mixed together. GNU Parallel is made for solving that problem. By default it guarantees output is not mixed together. So you can simply do this:

find folder/ | parallel ./logger.py > collab

This will run one logger.py per CPU. If you want 10:

find folder/ | parallel -P10 ./logger.py > collab

Watch the introvideo to learn more about GNU Parallel http://www.youtube.com/watch?v=OpaiGYxkSuQ

Joining concurrent Python Output

Tags:

python

bash

xargs

Josh K

3 Answers

Glenn Maynard

Oren Mazor

Ole Tange

Recent Activity

Donate For Us

Joining concurrent Python Output

Tags:

python

bash

xargs

Josh K

3 Answers

Glenn Maynard

Oren Mazor

Ole Tange

Related questions

Recent Activity

Donate For Us