I have a large set of files for which some heavy processing needs to be done. This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run. My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.
In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.
However a very simple example shell script like this will trash the system performance due to excessive load and swapping:
find . -type f | while read name ; do some_heavy_processing_command ${name} & done
So what I want is essentially similar to what "gmake -j4" does.
I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).
What is the simplest/cleanest/best solution to do what I want?
Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash The "xargs --max-procs=4" works like a charm. (So I voted to close my own question)
Luckily, there are multiple powerful command-line tools for parallelization in Linux systems that can help us achieve this. In this tutorial, we’re going to see how to use the Bash ampersand & operator, xargs, and GNU parallel to achieve parallelization on the Linux command line. 2. A Sample Task
3. Using & As a basic way to run commands in parallel, we can use the built-in Bash ampersand & operator to run a command asynchronously so that the shell doesn’t wait for the current command to complete before moving on to the next one: This will create two processes that will start at essentially the same instant and run in parallel.
There are many common tasks in Linux that we may want to consider running in parallel, such as: 1 Downloading a large number of files 2 Encoding/decoding a large number of images on a machine with multiple CPU cores 3 Making a computation with many different parameters and storing the results More ...
If you run a command in your Bash terminal, you will have to wait for it to complete, before running your next command. For example, if you run a sleep 60 command in your terminal, it wait 60 seconds for the “sleep” to complete before you can enter the next command.
I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)
function max2 { while [ `jobs | wc -l` -ge 2 ] do sleep 5 done } find . -type f | while read name ; do max2; some_heavy_processing_command ${name} & done wait
#! /usr/bin/env bash set -o monitor # means: run background processes in a separate processes... trap add_next_job CHLD # execute add_next_job when we receive a child complete signal todo_array=($(find . -type f)) # places output into an array index=0 max_jobs=2 function add_next_job { # if still jobs to do then add one if [[ $index -lt ${#todo_array[*]} ]] # apparently stackoverflow doesn't like bash syntax # the hash in the if is not a comment - rather it's bash awkward way of getting its length then echo adding job ${todo_array[$index]} do_job ${todo_array[$index]} & # replace the line above with the command you want index=$(($index+1)) fi } function do_job { echo "starting job $1" sleep 2 } # add initial set of jobs while [[ $index -lt $max_jobs ]] do add_next_job done # wait for all jobs to complete wait echo "done"
Having said that Fredrik makes the excellent point that xargs does exactly what you want...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With