I'm trying to use <code>GNU parallel</code> to post a lot of files to a web server. In my directory, I have some files: <pre class="prettyprint"><code>file1.xml file2.xml </code></pre> and I have a shell script that looks like this: <pre class="prettyprint"><code>#! /usr/bin/env bash CMD="curl -X POST -d@$1 http://server/path" eval $CMD </code></pre> There's some other stuff in the script, but this was the simplest example. I tried to execute the following command: <pre class="prettyprint"><code>ls | parallel -j2 script.sh {} </code></pre> Which is what the <code>GNU parallel</code> pages show as the "normal" way to operate on files in a directory. This seems to pass the name of the file into my script, but curl complains that it can't load the data file passed in. However, if I do: <pre class="prettyprint"><code>find . -name '*.xml' | parallel -j2 script.sh {} </code></pre> it works fine. Is there a difference between how <code>ls</code> and <code>find</code> are passing arguments to my script? Or do I need to do something additional in that script?

GNU <code>parallel</code> is a variant of <code>xargs</code>. They both have very similar interfaces, and if you're looking for help on <code>parallel</code>, you may have more luck looking up information about <code>xargs</code>. That being said, the way they both operate is fairly simple. With their default behavior, both programs read input from STDIN, then break the input up into tokens based on whitespace. Each of these tokens is then passed to a provided program as an argument. The default for xargs is to pass as many tokens as possible to the program, and then start a new process when the limit is hit. I'm not sure how the default for parallel works. Here is an example: <pre class="prettyprint"><code>> echo "foo bar \ baz" | xargs echo foo bar baz </code></pre> There are some problems with the default behavior, so it is common to see several variations. The first issue is that because whitespace is used to tokenize, any files with white space in them will cause parallel and xargs to break. One solution is to tokenize around the NULL character instead. <code>find</code> even provides an option to make this easy to do: <pre class="prettyprint"><code>> echo "Success!" > bad\ filename > find . "bad\ filename" -print0 | xargs -0 cat Success! </code></pre> The <code>-print0</code> option tells <code>find</code> to seperate files with the NULL character instead of whitespace. The <code>-0</code> option tells <code>xargs</code> to use the NULL character to tokenize each argument. Note that <code>parallel</code> is a little better than <code>xargs</code> in that its default behavior is the tokenize around only newlines, so there is less of a need to change the default behavior. Another common issue is that you may want to control how the arguments are passed to <code>xargs</code> or <code>parallel</code>. If you need to have a specific placement of the arguments passed to the program, you can use <code>{}</code> to specify where the argument is to be placed. <pre class="prettyprint"><code>> mkdir new_dir > find -name *.xml | xargs mv {} new_dir </code></pre> This will move all files in the current directory and subdirectories into the new_dir directory. It actually breaks down into the following: <pre class="prettyprint"><code>> find -name *.xml | xargs echo mv {} new_dir > mv foo.xml new_dir > mv bar.xml new_dir > mv baz.xml new_dir </code></pre> So taking into consideration how <code>xargs</code> and <code>parallel</code> work, you should hopefully be able to see the issue with your command. <code>find . -name '*.xml'</code> will generate a list of xml files to be passed to the <code>script.sh</code> program. <pre class="prettyprint"><code>> find . -name '*.xml' | parallel -j2 echo script.sh {} > script.sh foo.xml > script.sh bar.xml > script.sh baz.xml </code></pre> However, <code>ls | parallel -j2 script.sh {}</code> will generate a list of ALL files in the current directory to be passed to the script.sh program. <pre class="prettyprint"><code>> ls | parallel -j2 echo script.sh {} > script.sh some_directory > script.sh some_file > script.sh foo.xml > ... </code></pre> A more correct variant on the <code>ls</code> version would be as follows: <pre class="prettyprint"><code>> ls *.xml | parallel -j2 script.sh {} </code></pre> However, and important difference between this and the find version is that find will search through all subdirectories for files, while ls will only search the current directory. The equivalent <code>find</code> version of the above <code>ls</code> command would be as follows: <pre class="prettyprint"><code>> find -maxdepth 1 -name '*.xml' </code></pre> This will only search the current directory.

"find" and "ls" with GNU parallel

Tags:

linux

find

bash

parallel-processing

gnu-parallel

I'm trying to use GNU parallel to post a lot of files to a web server. In my directory, I have some files:

file1.xml
file2.xml

and I have a shell script that looks like this:

#! /usr/bin/env bash

CMD="curl -X POST -d@$1 http://server/path"

eval $CMD

There's some other stuff in the script, but this was the simplest example. I tried to execute the following command:

ls | parallel -j2 script.sh {}

Which is what the GNU parallel pages show as the "normal" way to operate on files in a directory. This seems to pass the name of the file into my script, but curl complains that it can't load the data file passed in. However, if I do:

find . -name '*.xml' | parallel -j2 script.sh {}

it works fine. Is there a difference between how ls and find are passing arguments to my script? Or do I need to do something additional in that script?

324

asked Sep 30 '11 12:09

Dave

1 Answers

GNU parallel is a variant of xargs. They both have very similar interfaces, and if you're looking for help on parallel, you may have more luck looking up information about xargs.

That being said, the way they both operate is fairly simple. With their default behavior, both programs read input from STDIN, then break the input up into tokens based on whitespace. Each of these tokens is then passed to a provided program as an argument. The default for xargs is to pass as many tokens as possible to the program, and then start a new process when the limit is hit. I'm not sure how the default for parallel works.

Here is an example:

> echo "foo    bar \
  baz" | xargs echo
foo bar baz

There are some problems with the default behavior, so it is common to see several variations.

The first issue is that because whitespace is used to tokenize, any files with white space in them will cause parallel and xargs to break. One solution is to tokenize around the NULL character instead. find even provides an option to make this easy to do:

> echo "Success!" > bad\ filename
> find . "bad\ filename" -print0 | xargs -0 cat
Success!

The -print0 option tells find to seperate files with the NULL character instead of whitespace.
The -0 option tells xargs to use the NULL character to tokenize each argument.

Note that parallel is a little better than xargs in that its default behavior is the tokenize around only newlines, so there is less of a need to change the default behavior.

Another common issue is that you may want to control how the arguments are passed to xargs or parallel. If you need to have a specific placement of the arguments passed to the program, you can use {} to specify where the argument is to be placed.

> mkdir new_dir
> find -name *.xml | xargs mv {} new_dir

This will move all files in the current directory and subdirectories into the new_dir directory. It actually breaks down into the following:

> find -name *.xml | xargs echo mv {} new_dir
> mv foo.xml new_dir
> mv bar.xml new_dir
> mv baz.xml new_dir

So taking into consideration how xargs and parallel work, you should hopefully be able to see the issue with your command. find . -name '*.xml' will generate a list of xml files to be passed to the script.sh program.

> find . -name '*.xml' | parallel -j2 echo script.sh {}
> script.sh foo.xml
> script.sh bar.xml
> script.sh baz.xml

However, ls | parallel -j2 script.sh {} will generate a list of ALL files in the current directory to be passed to the script.sh program.

> ls | parallel -j2 echo script.sh {}
> script.sh some_directory
> script.sh some_file
> script.sh foo.xml
> ...

A more correct variant on the ls version would be as follows:

> ls *.xml | parallel -j2 script.sh {}

However, and important difference between this and the find version is that find will search through all subdirectories for files, while ls will only search the current directory. The equivalent find version of the above ls command would be as follows:

> find -maxdepth 1 -name '*.xml'

This will only search the current directory.

121

answered Oct 03 '22 18:10

Swiss

Related questions
                            
                                In Linux, how do you use device_create within an existing class?
                            
                                Linux: handling a segmentation fault and getting a core dump
                            
                                Distribution independent libpython path
                            
                                X11 - Draw on Overlay Window
                            
                                linux/init.h: No such file or directory
                            
                                Multi-threaded C program much slower in OS X than Linux
                            
                                Memory usage of php process
                            
                                How to deploy Qt applications for Linux
                            
                                wget using --timeout and --tries together
                            
                                High System CPU usage because of system.currentTimeMillis()
                            
                                Cross-platform implementation of SendKeys in C#?
                            
                                are posix pipes lightweight?
                            
                                Why ulimit can't limit resident memory successfully and how?
                            
                                Matlab - run file without opening GUI, then quit
                            
                                How do you compile/build/execute a C++ project in Geany?
                            
                                How can I make a Hashmap in Linux shell? [duplicate]
                            
                                Can I force a firefox page refresh from linux console
                            
                                How to open another file in background Vim from Bash command-line?
                            
                                Want the excutable run by execve() to use my preloaded library
                            
                                SCTP Multihoming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With