what does it mean configuring MPI for shared memory?

Tags:

I have a bit of research related question.

Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine. now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)

One thing I have noticed is that the performance of my implementation is not as good as the other implementations. I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation) while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)

There are some big difference on completion time of the two categories.

Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm

and there come comes my question.

1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do? (I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).

    shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out

2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine. (I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).

any suggestion or further discussion is very welcome.

Please let me know if I have to further clarify my question.

thank you for your time!

338

asked Nov 21 '12 21:11

LeTex

1 Answers

Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.

Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.

By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.

Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.

113

answered Sep 22 '22 04:09

Hristo Iliev

Related questions
                            
                                Haskell -- parallel map that makes less sparks
                            
                                Sequencing IO actions in parallel
                            
                                A parallel monad map in Haskell? Something like parMapM?
                            
                                Executing queries in parallel throws "The underlying provider failed on open." error
                            
                                Which operations on Scala parallel collections are parallelized?
                            
                                Deprecation of multicore (mclapply) in R 3.0
                            
                                Calculation of Moran's I with 4000 records
                            
                                Uploading files to s3 using s3cmd in parallel
                            
                                Catch exception from parallel stream
                            
                                Counting coloured pixels on the GPU - Theory
                            
                                No speedup in multithread program
                            
                                Observing Task exceptions within a ContinueWith
                            
                                About 'pseq' in Haskell
                            
                                Word Tearing on x86
                            
                                Am I allowed to throw an exception inside MPI-parallelized code?
                            
                                Track progress when using Parallel.ForEach
                            
                                Importing Modules that use MultiProcessing Python
                            
                                Asynchronous evaluation in Mathematica
                            
                                How to optimize for dual, quad and higher multi-processors?
                            
                                Parallel HTTP requests in PHP using PECL HTTP classes [Answer: HttpRequestPool class]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what does it mean configuring MPI for shared memory?

Tags:

parallel-processing

mpi

openmpi

shared-memory

message-passing

LeTex

People also ask

1 Answers

Hristo Iliev

Recent Activity

Donate For Us