Given a directory with a few million files in it we want to extract some data from those files.
find /dir/ -type f | awk -F"|" '$2 ~ /string/{ print $3"|"$7 }' > the_good_stuff.txt
That will never scale so we introduce xargs.
find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'
This produces valid output no matter how long we run it. Sweet so lets write it to a file by appending a > the_good_stuff_from_xargs.txt onto that command. Except now the file contains mangled lines.
What strikes me is that, while viewing the output of the six subprocesses that xargs opens as STDOUT in my terminal, the data look fine. The moment the data is redirected onto the filesystem is when corruption appears.
I've attempted appending the command with the following.
> myfile.txt
>> myfile.txt
| mawk '{print $0}' > myfile.txt
And various other concepts of redirecting or otherwise "pooling" the output of the xargs before writing it to disk with data being corrupted in each version.
I'm positive the raw files are not malformed. I'm positive that when viewed in terminal as stdout the command with xargs produces valid output for up to 10 minutes of staring at it spit text...
Local disk is an SSD... I'm reading and writing from the same file system.
Why does redirecting the output of find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }' cause the data to become malformed?
EDIT
I can't currently install unbuffer but stdbuf -oL -eL modifies the commands output to be line buffered and so, theoretically, should do the same thing.
I've tried both stdbuf xargs cmd and xargs stdbuf cmd both have resulted in exceedingly broken lines.
the -P6 is required in order for this command to complete in any reasonable amount of time.
EDIT 2
To clarify... xargs and it's -P6 flag are requirements to solve the problem because the directory we are working in has millions of files that must be scanned.
Obviously we could remove -P6 or in some other fashion stop running multiple jobs at once but that's not really answering the question of why the output is getting mangled nor is it a realistic approach to how the output can be restored to a state of "correct" while still accomplishing the task at scale.
Solution
The accepted answer mentioned using parallel which worked the best out of all the answers.
The final command I ran looked like.
time find -L /dir/ -type f -mtime -30 -print0 | parallel -0 -X awk -f manual.awk > the_good_stuff.txt
Awk was being difficult so I moved the -F"|" into the command itself. By default parallel will spin up a job per core on the box, you can use -j to set the number of jobs lower if need be.
In really scientific terms this was a massive speed increase. What took an unmeasured number of hours ( likely 6+ ) is 10% completed after 6 six minutes, so will likely finish within an hour.
One catch is that you have to make sure the command running in parallel isn't attempting to write to file... that effectively bypasses the output processing that parallel performs on the jobs it runs!
Lastly without -X parallel acts similar to xargs -n1.
man xargs mentions this problem: "Please note that it is up to the called processes to properly manage parallel access to shared resources. For example, if more than one of them tries to print to stdout, the ouptut will be produced in an indeterminate order (and very likely mixed up)"
luckily, there is a way to make this operation an order of magnitude faster and solve the mangling problem at the same time:
find /dir/ -type f -print0 | xargs -0 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'
why?
-P6 is shuffling your output, so don't use it. xargs -n1 launches one awk process for each file, whereas without n1, xargs launches many fewer awk processes, like this:
files | xargs -n1 awk
=>
awk file1
awk file2
...
awk fileN
vs
files | xargs awk
=>
awk file1 file2 ... fileN # or broken into a few awk commands if many files
i ran your code on ~20k text files each ~20k in size with and without -n1 -P6:
with -n1 -P6 23.138s
without 3.356s
if you want parallelism without xargs's stdout shuffling, use gnu parallel (also suggested by Gordon Davisson), e.g.:
find /dir/ -type f -print0 | parallel --xargs -0 -q awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'
note: -q is necessary to quote the command string, otherwise the quotes in -F"|" and around the awk code become unquoted when parallel runs them.
parallel saves a bit of time, but not as much as ditching -n1 did:
parallel 1.704s
ps: introducing a cat (which Matt does in his answer) is a tiny faster than just xargs awk:
xargs awk 3.356s
xargs cat | awk 3.036s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With