Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Joining concurrent Python Output

Tags:

python

bash

xargs

I'm using something like this:

find folder/ | xargs -n1 -P10 ./logger.py > collab

Inside logger.py I am processing the files out outputting reformatted lines. So collab should look like

{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}

Instead sometimes the lines are getting jumbled:

{'filename' : 'file1', 'size' : 1000}
{'file
{'filename' : 'file1', 'size' : 1000}
name' : 'file1', 'size' : 1000}
{'filename' : 'file1', 'size' : 1000}

How can I prevent / correct this?

like image 639
Josh K Avatar asked Feb 16 '11 21:02

Josh K


3 Answers

In general, there are issues that make it very hard to guarantee this won't happen, without delving into multi-process locking. However, you can usually reduce the problem a lot.

The most common cause of this is I/O buffering, within Python or libc. For example, it may be buffering 16k of output, and then writing the whole block at once. You can reduce that by flushing stdout after writing to it, but that's awkward. In theory, you should be able to pass -u to Python to disable stdout buffering, but that didn't work when I tried it. See Sebastjan's answer to Disable output buffering for a more generic solution (though there's probably a way to disable output buffering more directly).

A second problem is that the underlying writes aren't always atomic. In particular, writes to pipes are only atomic up to a certain size (PIPE_BUF, usually 512 bytes); above that it's not guaranteed. That only strictly applies to pipes (not files), but the same general issues apply: smaller writes are more likely to happen atomically. See http://www.opengroup.org/onlinepubs/000095399/functions/write.html.

like image 191
Glenn Maynard Avatar answered Sep 21 '22 00:09

Glenn Maynard


The complicated, and technically correct, solution would be to implement a mutex for writing, but that's suboptimal, I think.

And its not fun anyway. How about piping the output from xargs (that way you get solid chunks of output, instead of a stream of output that gets broken up) and then combine these chunks somehow?

like image 22
Oren Mazor Avatar answered Sep 20 '22 00:09

Oren Mazor


The problem is that the output from xargs is mixed together. GNU Parallel is made for solving that problem. By default it guarantees output is not mixed together. So you can simply do this:

find folder/ | parallel ./logger.py > collab

This will run one logger.py per CPU. If you want 10:

find folder/ | parallel -P10 ./logger.py > collab

Watch the introvideo to learn more about GNU Parallel http://www.youtube.com/watch?v=OpaiGYxkSuQ

like image 28
Ole Tange Avatar answered Sep 21 '22 00:09

Ole Tange