Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xargs: losing output when redirecting stdout to a file in parallel mode

Tags:

bash

shell

xargs

I am using GNU xargs (version 4.2.2) in parallel mode and I seem to be reliably losing output when redirecting to a file. When redirecting to a pipe, it appears to work correctly.

The following shell commands demonstrates a minimum, complete, and verifiable example of the issue. I generate 2550 numbers using xargs to split it into lines of 100 args each totalling 26 lines where the 26th line contains only 50 args.

# generate numbers 1 to 2550 where each number is on its own line
$ seq 1 2550 > /tmp/nums
$ wc -l /tmp/nums
2550 /tmp/nums

# piping to wc is accurate: 26 lines, 2550 args
$ xargs -P20 -n 100 </tmp/nums | wc
     26    2550   11643

# redirecting to a file is clearly inaccurate: 22 lines, 2150 args
$ xargs -P20 -n 100 </tmp/nums >/tmp/out; wc /tmp/out
     22  2150 10043 /tmp/out

I believe the problem is not related to the underlying shell since the shell will perform the redirection before the commands execute and wait for xargs to complete. In this case, I hypothesize xargs is completing before flushing the buffer. However if my hypothesis is correct, I do not know why this problem doesn't manifest when writing to a pipe.

Edit:

It appears when using >> (create/append to file) in the shell, the problem doesn't seem to manifest:

# appending to file
$ >/tmp/out
$ xargs -P20 -n 100 </tmp/nums >>/tmp/out; wc /tmp/out
     26    2550   11643

# creating and appending to file
$ rm /tmp/out
$ xargs -P20 -n 100 </tmp/nums >>/tmp/out; wc /tmp/out
     26    2550   11643
like image 920
snap Avatar asked Nov 09 '22 05:11

snap


1 Answers

Your problem is due to the output from different processes being mixed. It is shown here:

parallel perl -e '\$a=\"1{}\"x10000000\;print\ \$a,\"\\n\"' '>' {} ::: a b c d e f
ls -l a b c d e f
parallel -kP4 -n1 grep 1 > out.par ::: a b c d e f
echo a b c d e f | xargs -P4 -n1 grep 1 > out.xargs-unbuf
echo a b c d e f | xargs -P4 -n1 grep --line-buffered 1 > out.xargs-linebuf
echo a b c d e f | xargs -n1 grep 1 > out.xargs-serial
ls -l out*
md5sum out*

The solution is to buffer the output from each job - either in memory or in tmpfiles (like GNU Parallel does).

like image 185
Ole Tange Avatar answered Nov 15 '22 05:11

Ole Tange