I have a dual socket Xeon E5522 2.26GHZ machine (with hyperthreading disabled) running ubuntu server on linux kernel 3.0 supporting NUMA. The architecture layout is 4 physical cores per socket. An OpenMP application runs in this machine and i have the following questions: <ol> <li>Does an OpenMP program take advantage (i.e a thread and its private data are kept on a numa node along the execution) automatically when running on a NUMA machine + aware kernel?. If not, what can be done?</li> <li>what about NUMA and per thread private C++ STL data structures ?</li> </ol>

The current OpenMP standard defines a boolean environment variable <code>OMP_PROC_BIND</code> that controlls binding of OpenMP threads. If set to <code>true</code>, e.g. <pre class="prettyprint"><code>shell$ OMP_PROC_BIND=true OMP_NUM_THREADS=12 ./app.x </code></pre> then the OpenMP execution environment should not move threads between processors. Unfortunately nothing more is said about how those threads should be bound and that's what a special working group in the OpenMP language comittee is addressing right now. OpenMP 4.0 will come with new envrionment variables and clauses that will allow one to specify how to distribute the threads. Of course, many OpenMP implementations offer their own non-standard methods to control binding. Still most OpenMP runtimes are not NUMA aware. They will happily dispatch threads to any available CPU and you would have to make sure that each thread only access data that belongs to it. There are some general hints in this direction: <ul> <li>Do not use <code>dynamic</code> scheduling for parallel <code>for</code> (C/C++) / <code>DO</code> (Fortran) loops.</li> <li>Try to initialise the data in the same thread that will later use it. If you run two separete parallel <code>for</code> loops with the same team size and the same number of iteration chunks, with <code>static</code> scheduling chunk 0 of both loops will be executed by thread 0, chunk 1 - by thread 1, and so on.</li> <li>If using OpenMP tasks, try to initialise the data in the task body, because most OpenMP runtimes implement task stealing - idle threads can steal tasks from other threads' task queues.</li> <li>Use a NUMA-aware memory allocator.</li> </ul> Some colleagues of mine have thoroughly evaluated the NUMA behavious of different OpenMP runtimes and have specifically looked into the NUMA awareness of the Intel's implementation, but the articles are not published yet so I cannot provide you with a link. There is one research project, called ForestGOMP, which aims at providing a NUMA-aware drop-in replacement for <code>libgomp</code>. May be you should give it a look.

OpenMP and NUMA relation?

Tags:

c++

parallel-processing

smp

openmp

numa

I have a dual socket Xeon E5522 2.26GHZ machine (with hyperthreading disabled) running ubuntu server on linux kernel 3.0 supporting NUMA. The architecture layout is 4 physical cores per socket. An OpenMP application runs in this machine and i have the following questions:

Does an OpenMP program take advantage (i.e a thread and its private data are kept on a numa node along the execution) automatically when running on a NUMA machine + aware kernel?. If not, what can be done?
what about NUMA and per thread private C++ STL data structures ?

777

asked Aug 14 '12 20:08

labotsirc

1 Answers

The current OpenMP standard defines a boolean environment variable OMP_PROC_BIND that controlls binding of OpenMP threads. If set to true, e.g.

Click to copy

shell$ OMP_PROC_BIND=true OMP_NUM_THREADS=12 ./app.x

then the OpenMP execution environment should not move threads between processors. Unfortunately nothing more is said about how those threads should be bound and that's what a special working group in the OpenMP language comittee is addressing right now. OpenMP 4.0 will come with new envrionment variables and clauses that will allow one to specify how to distribute the threads. Of course, many OpenMP implementations offer their own non-standard methods to control binding.

Still most OpenMP runtimes are not NUMA aware. They will happily dispatch threads to any available CPU and you would have to make sure that each thread only access data that belongs to it. There are some general hints in this direction:

Do not use dynamic scheduling for parallel for (C/C++) / DO (Fortran) loops.
Try to initialise the data in the same thread that will later use it. If you run two separete parallel for loops with the same team size and the same number of iteration chunks, with static scheduling chunk 0 of both loops will be executed by thread 0, chunk 1 - by thread 1, and so on.
If using OpenMP tasks, try to initialise the data in the task body, because most OpenMP runtimes implement task stealing - idle threads can steal tasks from other threads' task queues.
Use a NUMA-aware memory allocator.

Some colleagues of mine have thoroughly evaluated the NUMA behavious of different OpenMP runtimes and have specifically looked into the NUMA awareness of the Intel's implementation, but the articles are not published yet so I cannot provide you with a link.

There is one research project, called ForestGOMP, which aims at providing a NUMA-aware drop-in replacement for libgomp. May be you should give it a look.

answered Sep 28 '22 05:09

Hristo Iliev

Related questions
                            
                                vector<T>::swap and temporary object
                            
                                do we need to recompile libraries with c++11?
                            
                                Using commas inside a macro without parenthesis: How can I mix and match with a template?
                            
                                Dynamic and static linking and deployment in Visual Studio 2010
                            
                                QObject::deleteLater across a QThread
                            
                                self made pow() c++
                            
                                C++ function to find max value in an array of doubles?
                            
                                How I can check if a Window has visible scrollbars using his HWND?
                            
                                Taking address of temporary - workaround needed
                            
                                Why doesn't printf format unicode parameters?
                            
                                C++ Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call
                            
                                QDoubleValidator is not working?
                            
                                Arrays with 0 elements
                            
                                OpenCV 2.3: Convert Mat to RGBA pixel array
                            
                                Volatile Member Functions (C++)
                            
                                string[length()] in C++, is it OK?
                            
                                How to split a file lines with space and tab differentiation? [duplicate]
                            
                                Free memory used by a std::string
                            
                                Can't see boost::optional contents when debugging with Visual Studio
                            
                                Debug C++ code: Catch first NaN appearance [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With