I'm trying to use GNU parallel
to post a lot of files to a web server. In my directory, I have some files:
file1.xml
file2.xml
and I have a shell script that looks like this:
#! /usr/bin/env bash
CMD="curl -X POST -d@$1 http://server/path"
eval $CMD
There's some other stuff in the script, but this was the simplest example. I tried to execute the following command:
ls | parallel -j2 script.sh {}
Which is what the GNU parallel
pages show as the "normal" way to operate on files in a directory. This seems to pass the name of the file into my script, but curl complains that it can't load the data file passed in. However, if I do:
find . -name '*.xml' | parallel -j2 script.sh {}
it works fine. Is there a difference between how ls
and find
are passing arguments to my script? Or do I need to do something additional in that script?
xargs will run the first two commands in parallel, and then whenever one of them terminates, it will start another one, until the entire job is done. The same idea can be generalized to as many processors as you have handy. It also generalizes to other resources besides processors.
Get GNU parallel (e.g. brew install parallel , apt-get install parallel , etc.). Run grep in parallel blocks on a single file. Run grep on multiple files in parallel, in this case all files in a directory and its subdirectories. Add /dev/null to force grep to prepend the filename to the matching line.
The (ls -lh)command will give you the data in terms of Mb, Gb, Tb, etc. If you want to display your files in descending order (highest at the top) according to their size, then you can use (ls -lhS) command. It is used to display the files in a specific size format.
parallel runs the specified command, passing it a single one of the specified arguments. This is repeated for each argument. Jobs may be run in parallel. The default is to run one job per CPU. If no command is specified before the --, the commands after it are instead run in parallel.
GNU parallel
is a variant of xargs
. They both have very similar interfaces, and if you're looking for help on parallel
, you may have more luck looking up information about xargs
.
That being said, the way they both operate is fairly simple. With their default behavior, both programs read input from STDIN, then break the input up into tokens based on whitespace. Each of these tokens is then passed to a provided program as an argument. The default for xargs is to pass as many tokens as possible to the program, and then start a new process when the limit is hit. I'm not sure how the default for parallel works.
Here is an example:
> echo "foo bar \
baz" | xargs echo
foo bar baz
There are some problems with the default behavior, so it is common to see several variations.
The first issue is that because whitespace is used to tokenize, any files with white space in them will cause parallel and xargs to break. One solution is to tokenize around the NULL character instead. find
even provides an option to make this easy to do:
> echo "Success!" > bad\ filename
> find . "bad\ filename" -print0 | xargs -0 cat
Success!
The -print0
option tells find
to seperate files with the NULL character instead of whitespace.
The -0
option tells xargs
to use the NULL character to tokenize each argument.
Note that parallel
is a little better than xargs
in that its default behavior is the tokenize around only newlines, so there is less of a need to change the default behavior.
Another common issue is that you may want to control how the arguments are passed to xargs
or parallel
. If you need to have a specific placement of the arguments passed to the program, you can use {}
to specify where the argument is to be placed.
> mkdir new_dir
> find -name *.xml | xargs mv {} new_dir
This will move all files in the current directory and subdirectories into the new_dir directory. It actually breaks down into the following:
> find -name *.xml | xargs echo mv {} new_dir
> mv foo.xml new_dir
> mv bar.xml new_dir
> mv baz.xml new_dir
So taking into consideration how xargs
and parallel
work, you should hopefully be able to see the issue with your command. find . -name '*.xml'
will generate a list of xml files to be passed to the script.sh
program.
> find . -name '*.xml' | parallel -j2 echo script.sh {}
> script.sh foo.xml
> script.sh bar.xml
> script.sh baz.xml
However, ls | parallel -j2 script.sh {}
will generate a list of ALL files in the current directory to be passed to the script.sh program.
> ls | parallel -j2 echo script.sh {}
> script.sh some_directory
> script.sh some_file
> script.sh foo.xml
> ...
A more correct variant on the ls
version would be as follows:
> ls *.xml | parallel -j2 script.sh {}
However, and important difference between this and the find version is that find will search through all subdirectories for files, while ls will only search the current directory. The equivalent find
version of the above ls
command would be as follows:
> find -maxdepth 1 -name '*.xml'
This will only search the current directory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With