How to control parallel tasks in Linux to avoid too much context switch

Question

Now I'm using Linux to perform the following task:

while read parameter
do
    ./program_a $parameter $parameter.log 2>&1 &
done < parameter_file

Each parameter refers to the name of the file to be processed. Each file contains a different number of lines to process.

For example:
Parameter file contains:

File_A
File_B
File_C

File_A contains 1k lines, File_B contains 10k lines and File_C contains 1000k lines, which means that in the above script program_a simultaneously processes 1000 lines, 10k lines and 1000k lines respectively. The processing time for each task is almost linearly dependent on the number of lines and each task is independent.

I have 6 cores CPU with 12 threads. Because processing time could vary so that after running tasks for File_A and File_B, only one core will process the task for File_C. This is wasting resources.

I want to split each file to 1k lines and run them simultaneously. But for this example there will be 1011 tasks running (1k for each task). I think this will lead to a serious overly context switch problem. Maybe I can tune to number in each line to solve this problem, but I don't think this is a good solution.

My thought is to limit the tasks running will be always 6 tasks which means always using maximum number of cores to run and reduce context switches to as few as possible. But I don't know how to modify my script to achieve this goal. Anyone can give me some advice?

Joshua Goldberg · Accepted Answer

I wouldn't try to reinvent the load-balancing wheel by splitting the files. Use gnu parallel to handle the management of the tasks of different scales. It has plenty of options for parallel execution on one or multiple machines. If you set it up to, say, allow 4 processes in parallel, it will do that, starting a new task when a shorter one completes.

https://www.gnu.org/software/parallel/

https://www.gnu.org/software/parallel/parallel_tutorial.html

Here's a simple example using cat as a standin for ./program:

...write a couple of files
% cat > a
a
b
c

% cat > b
a  
b
c
d

% cat > files
a
b

... run the tasks
% parallel cat {1} \> {1}.log < files

% more b.log
a
b
c
d

How to control parallel tasks in Linux to avoid too much context switch

Tags:

linux

bash

shell

unix

multithreading

Marcus Thornton

1 Answers

Joshua Goldberg

Recent Activity

Donate For Us

How to control parallel tasks in Linux to avoid too much context switch

Tags:

linux

bash

shell

unix

multithreading

Marcus Thornton

1 Answers

Joshua Goldberg

Related questions

Recent Activity

Donate For Us