Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallelize bash script

Tags:

bash

I need the sum of an integer contained in several webpages. getPages() parses the integer and sets it to $subTotal. getPages() is called in a for loop in background, but how do I get the sum of $subTotal? Is this a subshelling problem?

This is what I've tried so far.

#!/bin/bash
total=0
getPages(){
  subTotal=$(lynx -dump http://"$(printf "%s:%s" $1 $2)"/file.html | awk -F, 'NR==1 {print $1}' | sed 's/\s//g')
  total=$(($total+$subTotal))
  echo "SubTotal: " $subTotal "Total: " $total
}
# /output/ SubTotal:  22 Total:  22
# /output/ SubTotal:  48 Total:  48   //Note Total should be 70

ARRAY=(
'pf2.server.com:6599'
'pf5.server.com:1199'
...
)

for server in ${ARRAY[@]} ; do
  KEY=${server%%:*}
  VALUE=${server##*:}
  getPages $KEY $VALUE &
done
wait
  echo $total
exit 0        

# /output/ 0

Any advice appreciated.

like image 450
Eric Fortis Avatar asked Jul 04 '11 19:07

Eric Fortis


1 Answers

Yes, this is a subshelling problem. Everything executed in a ... & list (i.e. your getPages $KEY $VALUE &) is executed in a subshell, which means that changes of variables there do not affect the parent shell.

I think one could do something using coprocesses (i.e. communication by streams), or maybe using GNU parallel or pexec.


Here is an example with pexec, using the default output to communicate from the single processes. I used a simpler command as the servers you listed are not accessible from here. This counts the lines on some webpages and sums them up.

ARRAY=(
   'www.gmx.de:80'
   'www.gmx.net:80'
   'www.gmx.at:80'
   'www.gmx.li:80'
)


(( total = 0 ))
while read subtotal
do
   (( total += subtotal ))
   echo "subtotal: $subtotal, total: $total"
done < <(
    pexec --normal-redirection --environment hostname --number ${#ARRAY[*]} \
     --parameters "${ARRAY[@]}" --shell-command -- '
     lynx -dump http://$hostname/index.html | wc -l'
    )

echo "total: $total"

We are using some tricks here:

  • we pipe the output of the parallel processes back to the main process, reading it in a loop there.
  • To avoid the creating of a subshell for the while loop, we use bash's process substitution feature (<( ... )) together with input redirection (<) instead of a simple pipe.
  • We do arithmetic in a (( ... )) arithmetic expression command. I could have used let, instead, but then I would have to quote everything or avoid spaces. (Your total=$(( total + subtotal )) would have worked, too.)
  • the options to pexec:
    • --normal-redirection means redirecting all the output streams from the subprocesses together into the output stream of pexec. (I'm not sure this could result in some confusion if two processes want to write at the same time.)
    • --environment hostname passes the differing parameter for each execution as a environment variable. Otherwise it would be a simple command line argument.
    • --number ${#ARRAY[*]} (which gets --number 4 in our case) makes sure that the all the processes will be started in parallel, instead of only as many as we have CPUs or some other heuristic. (This is for network-roundtrip-bound work. For CPU-bound or bandwidth-bound stuff, a smaller number would be better.)
    • --shell-command makes sure the command will be evaluated by a shell, instead of trying to execute it directly. This is necessary because of the pipeline in there.
    • --parameters "${ARRAY[@]}" lists the actual arguments - i.e. the elements of the array. For each of them a separate version of the command will be started.
    • after the final -- comes the command - as a single '-quoted string, to avoid premature interpretation of the $hostname in there by the outer shell. The command simple downloads the file and pipes it to wc -l, counting the lines.

Example output:

subtotal: 1120, total: 1120
subtotal: 968, total: 2088
subtotal: 1120, total: 3208
subtotal: 1120, total: 4328
total: 4328

Here is (part of) the output of ps -f while this is running:

 2799 pts/1    Ss     0:03  \_ bash
 5427 pts/1    S+     0:00      \_ /bin/bash ./download-test.sh
 5428 pts/1    S+     0:00          \_ /bin/bash ./download-test.sh
 5429 pts/1    S+     0:00              \_ pexec --number 4 --normal-redirection --environment hostname --parame...
 5430 pts/1    S+     0:00                  \_ /bin/sh -c ?     lynx -dump http://$hostname/index.html | wc -l
 5434 pts/1    S+     0:00                  |   \_ lynx -dump http://www.gmx.de:80/index.html
 5435 pts/1    S+     0:00                  |   \_ wc -l
 5431 pts/1    S+     0:00                  \_ /bin/sh -c ?     lynx -dump http://$hostname/index.html | wc -l
 5436 pts/1    S+     0:00                  |   \_ lynx -dump http://www.gmx.net:80/index.html
 5437 pts/1    S+     0:00                  |   \_ wc -l
 5432 pts/1    S+     0:00                  \_ /bin/sh -c ?     lynx -dump http://$hostname/index.html | wc -l
 5438 pts/1    S+     0:00                  |   \_ lynx -dump http://www.gmx.at:80/index.html
 5439 pts/1    S+     0:00                  |   \_ wc -l
 5433 pts/1    S+     0:00                  \_ /bin/sh -c ?     lynx -dump http://$hostname/index.html | wc -l
 5440 pts/1    S+     0:00                      \_ lynx -dump http://www.gmx.li:80/index.html
 5441 pts/1    S+     0:00                      \_ wc -l

We can see that really everything runs in parallel, as much as possible on my one-processor system.

like image 72
Paŭlo Ebermann Avatar answered Oct 19 '22 02:10

Paŭlo Ebermann