Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running a limited number of child processes in parallel in bash? [duplicate]

Tags:

I have a large set of files for which some heavy processing needs to be done. This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run. My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.

In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.

However a very simple example shell script like this will trash the system performance due to excessive load and swapping:

find . -type f | while read name ;  do     some_heavy_processing_command ${name} & done 

So what I want is essentially similar to what "gmake -j4" does.

I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).

What is the simplest/cleanest/best solution to do what I want?


Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash The "xargs --max-procs=4" works like a charm. (So I voted to close my own question)

like image 389
Niels Basjes Avatar asked Jul 06 '11 08:07

Niels Basjes


People also ask

How to achieve parallelization on the Linux command line?

Luckily, there are multiple powerful command-line tools for parallelization in Linux systems that can help us achieve this. In this tutorial, we’re going to see how to use the Bash ampersand & operator, xargs, and GNU parallel to achieve parallelization on the Linux command line. 2. A Sample Task

How do I run multiple commands at the same time in Bash?

3. Using & As a basic way to run commands in parallel, we can use the built-in Bash ampersand & operator to run a command asynchronously so that the shell doesn’t wait for the current command to complete before moving on to the next one: This will create two processes that will start at essentially the same instant and run in parallel.

What are some common tasks in Linux that can be run in parallel?

There are many common tasks in Linux that we may want to consider running in parallel, such as: 1 Downloading a large number of files 2 Encoding/decoding a large number of images on a machine with multiple CPU cores 3 Making a computation with many different parameters and storing the results More ...

How do I wait for a command to complete in Bash?

If you run a command in your Bash terminal, you will have to wait for it to complete, before running your next command. For example, if you run a sleep 60 command in your terminal, it wait 60 seconds for the “sleep” to complete before you can enter the next command.


2 Answers

I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)

function max2 {    while [ `jobs | wc -l` -ge 2 ]    do       sleep 5    done }  find . -type f | while read name ;  do     max2; some_heavy_processing_command ${name} & done wait 
like image 148
BruceH Avatar answered Oct 26 '22 00:10

BruceH


#! /usr/bin/env bash  set -o monitor  # means: run background processes in a separate processes... trap add_next_job CHLD  # execute add_next_job when we receive a child complete signal  todo_array=($(find . -type f)) # places output into an array  index=0 max_jobs=2  function add_next_job {     # if still jobs to do then add one     if [[ $index -lt ${#todo_array[*]} ]]     # apparently stackoverflow doesn't like bash syntax     # the hash in the if is not a comment - rather it's bash awkward way of getting its length     then         echo adding job ${todo_array[$index]}         do_job ${todo_array[$index]} &          # replace the line above with the command you want         index=$(($index+1))     fi }  function do_job {     echo "starting job $1"     sleep 2 }  # add initial set of jobs while [[ $index -lt $max_jobs ]] do     add_next_job done  # wait for all jobs to complete wait echo "done" 

Having said that Fredrik makes the excellent point that xargs does exactly what you want...

like image 28
Dunes Avatar answered Oct 26 '22 01:10

Dunes