I am trying to get a hybrid OpenMP / MPI job to run so that OpenMP threads are separated by core (only one thread per core). I have seen other answers which use numa-ctl and bash scripts to set environment variables, and I don't want to do this. I would like to be able to do this only by setting OMP_NUM_THREADS and or OMP_PROC_BIND and mpiexec options on the command line. I have tried the following - let's say I want 2 MPI processes that each have 2 OpenMP threads, and each of the threads are run on separate cores, so I want 4 cores total. <pre class="prettyprint"><code>OMP_PROC_BIND=true OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 </code></pre> This splits the jobs so that only two processes are at work, and they are all on the same CPU, so they are each only using about 25% of the CPU. If I try: <pre class="prettyprint"><code>OMP_PROC_BIND=false OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 </code></pre> then, I just get two separate MPI processes, each running at 100% or over 100% of their CPU power, according to top. This doesn't seem to show different cores being used for OpenMP threads. How do I force the system to put separate threads on separate cores? FYI, lscpu prints this: <pre class="prettyprint"><code>-CPU(s): 48 -On-line CPU(s) list: 0-47 -Thread(s) per core: 2 -Core(s) per socket: 12 -Socket(s): 2 -NUMA node(s): 2 </code></pre>

Actually, I'd expect your first example to work. Setting the <code>OMP_PROC_BIND=true</code> here is important, so that OpenMP stays within the CPU binding from the MPI process when pinning it's threads. Depending on the batch system and MPI implementation, there might be very individual ways to set these things up. Also Hyperthreading, or in general multiple hardware threads per core, that all show up as "cores" in your Linux, might be part of the problem as you'll never see 200% when two processes run on the two Hyperthreads of one cores. Here is a generic solution, I use when figuring these things for some MPI and some OpenMP implementation on some system. There's documentation from Cray which contains a very helpful program to figure these things out quickly, it's called <code>xthi.c</code>, google the filename or paste it from here (not sure if it's legal to paste it here...). Compile with: <pre class="prettyprint"><code>mpicc xthi.c -fopenmp -o xthi </code></pre> Now we can see what exactly is going on, for instance on a 2x 8 Core Xeon with Hyperthreading and Intel MPI (MPICH-based) we get: <pre class="prettyprint"><code>$ OMP_PROC_BIND=true OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi Hello from rank 0, thread 0, on localhost. (core affinity = 0,16) Hello from rank 0, thread 1, on localhost. (core affinity = 1,17) Hello from rank 1, thread 0, on localhost. (core affinity = 8,24) Hello from rank 1, thread 1, on localhost. (core affinity = 9,25) </code></pre> As you can see, core means, all the Hyperthreads of a core. Note how <code>mpirun</code> pins it different sockets, too by default. And With <code>OMP_PLACES=threads</code> you get one thread per core: <pre class="prettyprint"><code>$ OMP_PROC_BIND=true OMP_PLACES=threads OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi Hello from rank 0, thread 0, on localhost. (core affinity = 0) Hello from rank 0, thread 1, on localhost. (core affinity = 1) Hello from rank 1, thread 0, on localhost. (core affinity = 8) Hello from rank 1, thread 1, on localhost. (core affinity = 9) </code></pre> With <code>OMP_PROC_BIND=false</code> (your second example), I get: <pre class="prettyprint"><code>$ OMP_PROC_BIND=false OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi Hello from rank 0, thread 0, on localhost. (core affinity = 0-7,16-23) Hello from rank 0, thread 1, on localhost. (core affinity = 0-7,16-23) Hello from rank 1, thread 0, on localhost. (core affinity = 8-15,24-31) Hello from rank 1, thread 1, on localhost. (core affinity = 8-15,24-31) </code></pre> Here, each OpenMP thread gets a full socket, so the MPI ranks still operate on distinct resources. However, the OpenMP threads, within one process could be scheduled wildly by the OS across all cores. It's the same as just setting <code>OMP_NUM_THREADS=2</code> on my test system. Again, this might depend on specific OpenMP and MPI implementations and versions, but I think you'll easily figure out what's going on with the description above. Hope that helps.

Ensure hybrid MPI / OpenMP runs each OpenMP thread on a different core

Tags:

mpi

hpc

openmp

mpich

I am trying to get a hybrid OpenMP / MPI job to run so that OpenMP threads are separated by core (only one thread per core). I have seen other answers which use numa-ctl and bash scripts to set environment variables, and I don't want to do this.

I would like to be able to do this only by setting OMP_NUM_THREADS and or OMP_PROC_BIND and mpiexec options on the command line. I have tried the following - let's say I want 2 MPI processes that each have 2 OpenMP threads, and each of the threads are run on separate cores, so I want 4 cores total.

OMP_PROC_BIND=true OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2

This splits the jobs so that only two processes are at work, and they are all on the same CPU, so they are each only using about 25% of the CPU. If I try:

OMP_PROC_BIND=false OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2

then, I just get two separate MPI processes, each running at 100% or over 100% of their CPU power, according to top. This doesn't seem to show different cores being used for OpenMP threads.

How do I force the system to put separate threads on separate cores?

FYI, lscpu prints this:

-CPU(s):                48
-On-line CPU(s) list:   0-47
-Thread(s) per core:    2
-Core(s) per socket:    12
-Socket(s):             2
-NUMA node(s):          2

416

asked Dec 14 '17 20:12

v2v1

1 Answers

Actually, I'd expect your first example to work. Setting the OMP_PROC_BIND=true here is important, so that OpenMP stays within the CPU binding from the MPI process when pinning it's threads.

Depending on the batch system and MPI implementation, there might be very individual ways to set these things up.

Also Hyperthreading, or in general multiple hardware threads per core, that all show up as "cores" in your Linux, might be part of the problem as you'll never see 200% when two processes run on the two Hyperthreads of one cores.

Here is a generic solution, I use when figuring these things for some MPI and some OpenMP implementation on some system. There's documentation from Cray which contains a very helpful program to figure these things out quickly, it's called xthi.c, google the filename or paste it from here (not sure if it's legal to paste it here...). Compile with:

mpicc xthi.c -fopenmp -o xthi

Now we can see what exactly is going on, for instance on a 2x 8 Core Xeon with Hyperthreading and Intel MPI (MPICH-based) we get:

$ OMP_PROC_BIND=true OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi

Hello from rank 0, thread 0, on localhost. (core affinity = 0,16)
Hello from rank 0, thread 1, on localhost. (core affinity = 1,17)
Hello from rank 1, thread 0, on localhost. (core affinity = 8,24)
Hello from rank 1, thread 1, on localhost. (core affinity = 9,25)

As you can see, core means, all the Hyperthreads of a core. Note how mpirun pins it different sockets, too by default. And With OMP_PLACES=threads you get one thread per core:

$ OMP_PROC_BIND=true OMP_PLACES=threads OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi
Hello from rank 0, thread 0, on localhost. (core affinity = 0)
Hello from rank 0, thread 1, on localhost. (core affinity = 1)
Hello from rank 1, thread 0, on localhost. (core affinity = 8)
Hello from rank 1, thread 1, on localhost. (core affinity = 9)

With OMP_PROC_BIND=false (your second example), I get:

$ OMP_PROC_BIND=false OMP_PLACES=cores OMP_NUM_THREADS=2 mpiexec -n 2 ./xthi
Hello from rank 0, thread 0, on localhost. (core affinity = 0-7,16-23)
Hello from rank 0, thread 1, on localhost. (core affinity = 0-7,16-23)
Hello from rank 1, thread 0, on localhost. (core affinity = 8-15,24-31)
Hello from rank 1, thread 1, on localhost. (core affinity = 8-15,24-31)

Here, each OpenMP thread gets a full socket, so the MPI ranks still operate on distinct resources. However, the OpenMP threads, within one process could be scheduled wildly by the OS across all cores. It's the same as just setting OMP_NUM_THREADS=2 on my test system.

Again, this might depend on specific OpenMP and MPI implementations and versions, but I think you'll easily figure out what's going on with the description above.

Hope that helps.

answered Oct 02 '22 22:10

noma

Related questions
                            
                                understanding MPI send differences
                            
                                How to debug MPI with CLion?
                            
                                How can my program detect, whether it was launch via mpirun
                            
                                MPI - Asynchronous Broadcast/Gather
                            
                                Harmonic progression sum c++ MPI and OpenMP
                            
                                Simple MPI_Send and Recv gives segmentation fault (11) and Invalid Permission (2) with CUDA
                            
                                Composing VTK file from multiple MPI outputs
                            
                                Is there a limit for the message size in mpi using boost::mpi?
                            
                                Python "print" not working when embedded into MPI program
                            
                                How does MPI_IN_PLACE work with MPI_Scatter?
                            
                                MPI_Cart_Shift.Corner Neighborhood
                            
                                Install OpenMpi on windows 10
                            
                                Conditionally enable non-template function c++
                            
                                Trying to MPI_Send and MPI_Recv char array getting garbage
                            
                                What's the benefit of MPI Datatype?
                            
                                MPI: MPICH2 Installation and programming in LAN with Windows
                            
                                Freeing an array after it has been written to by MPI_Recv
                            
                                Bizarre deadlock in MPI_Allgather
                            
                                MPI Send and Recv Hangs with Buffer Size Larger Than 64kb
                            
                                How to write an MPI wrapper for dynamic loading

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With