Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal number of threads for GNU parallel

I think I have a fairly basic question. I just discovered the GNU parallel package and I think my workflow can really benefit from it! I am using a loop which loops through my read files and generates the desired output. The command that is excecuted for each read looks something like this:

STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq

As you can see I specified 8 threads, which is the amount of threads my virtual machine has.

My question now is this following: If I use GNU parallel with a command like this:

cat reads| parallel -j 3 STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn {}_R1.fq {}_R2.fq

Can my virtual machine handle the number of threads I specified, if I execute 3 jobs in parallel?

Or do I need 24 threads (3*8 threads) to properly excecute this command?

Im sorry if this is a basic question, I am very new to the field and any help is much appreciated!

like image 914
nhaus Avatar asked Sep 19 '25 16:09

nhaus


1 Answers

The best advice is simply: Try different values and measure.

In parallelization there are sooo many factors that can affect the results: Disk I/O, shared CPU cache, and shared RAM bandwidth just to name three.

top is your friend when measuring. If you can manage to get all CPUs to have <5% idle you are unlikely to go any faster - no matter what you do.

top - 14:49:10 up 10 days,  5:48, 123 users,  load average: 2.40, 1.72, 1.67
Tasks: 751 total,   3 running, 616 sleeping,   8 stopped,   4 zombie
%Cpu(s): 17.3 us,  6.2 sy,  0.0 ni, 76.2 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem :   31.239 total,    1.441 free,   21.717 used,    8.081 buff/cache
GiB Swap:  117.233 total,  104.146 free,   13.088 used.    4.706 avail Mem 

This machine is 76.2% idle. If your processes use loads of CPU then starting more processes in parallel here may help. If they use loads of disk I/O it may or may not help. Only way to know is to test and measure.

top - 14:51:00 up 10 days,  5:50, 124 users,  load average: 3.41, 2.04, 1.78
Tasks: 759 total,   8 running, 619 sleeping,   8 stopped,   4 zombie
%Cpu(s): 92.8 us,  6.9 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
GiB Mem :   31.239 total,    1.383 free,   21.772 used,    8.083 buff/cache
GiB Swap:  117.233 total,  104.146 free,   13.087 used.    4.649 avail Mem 

This machine is 0.1% idle. Starting more processes is unlikely to make things go faster.

So increase the parallelization until idle time hits a minimum or until average processing time hits a minimum (--joblog my.log can be useful to see how long a job takes).

And yes: GNU Parallel is likely to speed-up bioinformatics (being written by a fellow bioinformatician).

Consider reading GNU Parallel 2018 (paper: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html download: https://doi.org/10.5281/zenodo.1146014) Read at least chapter 1+2. It should take you less than 20 minutes. Your command line will love you for it.

like image 92
Ole Tange Avatar answered Sep 23 '25 11:09

Ole Tange