On a shared memory system running Linux, say it has 4 Intel Xeon-E5 CPUs and each CPU has 10 cores. PBS Pro is installed. And for example users use qsub -l select=1:ncpu=30
to run software programs if they wanted to run on 30 cores. Or would do setenv OMP_NUM_THREADS 30
for other software.
my question has mainly to do with commercial software packages that are based around MPI. Disregarding PBS and qsub for a moment, all you do to run these programs is either chose the number of cores to run on from a drop down menu after it starts, or from the prompt while launching it with something like ./cfd.exe -np 30
to use 30 cores.
system has 4 physical sockets = 4 CPUs;
each CPU has 10 cores = 40 total physical cores;
each core has hyperthreading, so a cat /proc/cpuinfo
will report back with 80 cpus or cores numbered from 0 to 79.
q1: I am confused as to when & how hyperthreading takes place, if it happens automatically behind the scenes, or if i have to somehow manually invoke it to happen.
For a system having many cores but i will keep using the above numbers for simplicity, now when PBS Pro and qsub are used and a user does qsub -l select=1:ncpu=20
they get allocated 10 physical cores numbered from say 10..19 and also get allocated 10 virtual cores numbered from 50..59. This brings me to question 2 below-
q2: What is the correct way to run?
If /proc/cpuinfo comes back with a total of 80 CPUs then am i safe to assume i can always do ./cfd.exe -np 80
or setenv OMP_NUM_THREADS 80
and be sure every core is not running at 50% ? Or must i never do greater than -np 40
and let the system handle it?
I use cfd software as an example, but i am also asking this with respect to software i and coworkers have written using OpenMP and other parallel directives.
q3: Am I correct in thinking that, if I launch a software program and specify it to run on 4 cores or it is hard coded to look for at most 4 cores to run in parallel then if the CPU is hyper threading capable, would hyper threading happen behind the scenes automatically? Such that if I were to disable hyper threading at the BIOS or EFI level then my program would run slower? Assume that the program and problem scales linearly and 8 cores should always twice as fast as 4, 16 cores always twice as fast as 8 cores, and so on. This question #3 i am most interested in understanding correctly.
Hyper-Threading (HT) means, that there are two processors* that share a physical core.
*: I use the term processor from Linux terminology. With activated Hyper-Threading a processor would be equal to one hardware-thread.
You do not explicitly use HT in an application. Whether HT it is used, depends if the application threads are running on processors that share a physical core.
How this is handled by the batch system depends on the configuration. In my experience HT is usually disabled in shared batch systems because it complicates things, leading to subtle performance issues and rarely provides significant performance benefits for optimized codes. There is some interesting documentation about how to deal with HT in PBS.
I would suggest you try and verify what kind of processors you get from the batch system by running the following job:
bash -c "taskset -p \$\$"
Note the escaped \$\$
uses the process id of the inner bash - not the one invoking the job submission.
The resulting hexadecimal affinity mask tells you which processors the job is running on. For example 5 = 00000101
would mean processor 0 and 2.
I think you misunderstand HT. It does not provide you a 2x speedup just because you have twice as many processors available. You might get 10% speedup, or your application might run slower. If your goal is performance, you will always prefer to use 4 processors that have separate cores rather than 4 processors that share 2 cores.
It highly depends on the application if it benefits from HT. If you want to utilize HT, just run the maximum amount of processes / threads to utilize all processors (or hardware-threads).
If you application does not benefit from HT, select the number of processes / threads to be the number of physical cores.
You can then help the scheduler by making sure your application threads are only allowed to utilize one hardware-thread per physical core, e.g. via PBS, taskset
or KMP_AFFINITY
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With