Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize for-loop in bash limiting number of processes

I have a bash script similar to:

NUM_PROCS=$1
NUM_ITERS=$2

for ((i=0; i<$NUM_ITERS; i++)); do
    python foo.py $i arg2 &
done

What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.

When I tried Charles Duffy's latest approach, I got the following error from bash -x:

+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line

... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.

like image 783
strathallan Avatar asked Aug 04 '16 18:08

strathallan


People also ask

Can you parallelize a bash script?

Often, you can Bash scripts in parallel, which can dramatically speed up the result.

How do you loop through a range of numbers in bash?

You can iterate the sequence of numbers in bash in two ways. One is by using the seq command, and another is by specifying the range in for loop. In the seq command, the sequence starts from one, the number increments by one in each step, and print each number in each line up to the upper limit by default.

Is Xargs parallel?

xargs will run the first two commands in parallel, and then whenever one of them terminates, it will start another one, until the entire job is done. The same idea can be generalized to as many processors as you have handy. It also generalizes to other resources besides processors.


Video Answer


5 Answers

A relatively simple way to accomplish this with only two additional lines of code. Explanation is inline.

NUM_PROCS=$1
NUM_ITERS=$2

for ((i=0; i<$NUM_ITERS; i++)); do
    python foo.py $i arg2 &
    let 'i>=NUM_PROCS' && wait -n # wait for one process at a time once we've spawned $NUM_PROC workers
done
wait # wait for all remaining workers
like image 161
rtx13 Avatar answered Oct 13 '22 21:10

rtx13


This isn't the simplest solution, but if your version of bash doesn't have "wait -n" and you don't want to use other programs like parallel, awk etc, here is a solution using while and for loops.

num_iters=10
total_threads=4
iter=1
while [[ "$iter" -lt "$num_iters" ]]; do
    iters_remainder=$(echo "(${num_iters}-${iter})+1" | bc)
    if [[ "$iters_remainder" -lt "$total_threads" ]]; then
        threads=$iters_remainder
    else
        threads=$total_threads
    fi
    for ((t=1; t<="$threads"; t++)); do
        (
            # do stuff
        ) &
        ((++iter))
    done 
    wait
done
like image 34
Jon Avatar answered Oct 04 '22 16:10

Jon


bash 4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.

#!/bin/bash

num_procs=$1
num_iters=$2
num_jobs="\j"  # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
  while (( ${num_jobs@P} >= num_procs )); do
    wait -n
  done
  python foo.py "$i" arg2 &
done
like image 25
chepner Avatar answered Oct 13 '22 21:10

chepner


GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P, no bash versions or package installs required. Here's 4 processes at a time:

printf "%s\0" {1..10} | xargs -0 -I @ -P 4 python foo.py @ arg2
like image 10
that other guy Avatar answered Oct 13 '22 21:10

that other guy


As a very simple implementation, depending on a version of bash new enough to have wait -n (to wait until only the next job exits, as opposed to waiting for all jobs):

#!/bin/bash
#      ^^^^ - NOT /bin/sh!

num_procs=$1
num_iters=$2

declare -A pids=( )

for ((i=0; i<num_iters; i++)); do
  while (( ${#pids[@]} >= num_procs )); do
    wait -n
    for pid in "${!pids[@]}"; do
      kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
    done
  done
  python foo.py "$i" arg2 & pids["$!"]=1
done

If running on a shell without wait -n, one can (very inefficiently) replace it with a command such as sleep 0.2, to poll every 1/5th of a second.


Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum):

num_procs=$1
infile=$2
for ((i=0; i<num_procs; i++)); do
  (
    while read -r line; do
      echo "Thread $i: processing $line"
    done < <(awk -v num_procs="$num_procs" -v i="$i" \
                 'NR % num_procs == i { print }' <"$infile")
  ) &
done
wait # wait for all the $num_procs subprocesses to finish
like image 8
Charles Duffy Avatar answered Oct 13 '22 19:10

Charles Duffy