Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gnu parallel to parallelize a for loop

I have seen several questions about this topic, but I lack the ability to translate this to my specific problem. I have a for loop that loops through sub directories and then executes a .sh script on a compressed text file inside each directory. I want to parallelize this process, but I'm struggling to apply gnu parallel.

Here is my loop:

for d in ./*/ ; do (cd "$d" && script.sh); done

I understand I need to input a list into parallel, so i have been trying this:

ls -d */ | parallel cd && script.sh

While this appears to get started, I get an error when gzip tries to unzip one of the txt files inside the directory, saying the file does not exist:

gzip: *.txt.gz: No such file or directory

However, when I run the original for loop, I have no issues aside from it taking a century to finish. Also, I only get the gzip error once when using parallel, which is so weird considering I have over 1000 sub-directories.

My questions are:

  1. How do I get Parallel to work in my case? How do I get parallel to parallelize the application of a .sh script to 1000s of files in their own sub-directories? ie- what is the solution to my problem? I gotta make progress.

  2. What am I missing? Syntax, loop, bad script? I want to learn.

  3. Is Parallel actually attempting to run all these .sh scripts in parallel? Why dont I get an error for every .txt.gz file?

  4. Is parallel the best option for the application? Is there another option that is better suited to my needs?

like image 734
Phil_T Avatar asked Aug 23 '17 03:08

Phil_T


1 Answers

Two problems:

  1. In:

    ls -d */ | parallel cd && script.sh
    

    what is paralleled is just cd, not script.sh. script.sh is only executed once, after all parallel cd jobs have run, if there was no error. It is the same as:

    ls -d */ | parallel cd
    if [ $? -eq 0 ]; then script.sh; fi
    
  2. You do not pass the target directory to cd. So, what is executed by parallel is just cd, which just changes the current directory to your home directory. The final script.sh is executed in the current directory (from where you invoked the command) where there are probably no *.txt.gz files, thus the error.

You can check yourself the effect of the first problem with:

$ mkdir /tmp/foobar && cd /tmp/foobar && mkdir a b c
$ ls -d */ | parallel cd && pwd
/tmp/foobar

The output of pwd is printed only once, even if you have more than one input directory. You can fix it by quoting the command and then check the second problem with:

$ ls -d */ | parallel 'cd && pwd'
/homes/myself
/homes/myself
/homes/myself

You should see as many pwd outputs as there are input directories but it is always the same output: your home directory. You can fix the second problem by using the {} replacement string that is substituted with the current input. Check it with:

$ ls -d */ | parallel 'cd {} && pwd'
/tmp/foobar/a
/tmp/foobar/b
/tmp/foobar/c

Now, you should have all input directories properly listed in the output.

For your specific problem this should work:

ls -d */ | parallel 'cd {} && script.sh'
like image 200
Renaud Pacalet Avatar answered Oct 17 '22 09:10

Renaud Pacalet