I have a dual socket Xeon E5522 2.26GHZ machine (with hyperthreading disabled) running ubuntu server on linux kernel 3.0 supporting NUMA. The architecture layout is 4 physical cores per socket. An OpenMP application runs in this machine and i have the following questions:
Does an OpenMP program take advantage (i.e a thread and its private data are kept on a numa node along the execution) automatically when running on a NUMA machine + aware kernel?. If not, what can be done?
what about NUMA and per thread private C++ STL data structures ?
Conclusions Linux NUMA tunings had a positive impact on performance of up to 4.2% for some HEP/NP benchmarks. However, specific tunings were best for different workloads and hardware.
NUMA is an alternative approach that links several small, cost-effective nodes using a high-performance connection. Each node contains processors and memory, much like a small SMP system. However, an advanced memory controller allows a node to use memory on all other nodes, creating a single system image.
A socket can contain one or more NUMA nodes with its cores and memory. A NUMA node will contain a set of cores and threads and memory which is local to the NUMA node. A core may have 0 or more threads. A socket refers to the physical location where a processor package plugs into a motherboard.
Non-uniform memory access, or NUMA, is a method of configuring a cluster of microprocessors in a multiprocessing system so they can share memory locally. The idea is to improve the system's performance and allow it to expand as processing needs evolve.
The current OpenMP standard defines a boolean environment variable OMP_PROC_BIND
that controlls binding of OpenMP threads. If set to true
, e.g.
shell$ OMP_PROC_BIND=true OMP_NUM_THREADS=12 ./app.x
then the OpenMP execution environment should not move threads between processors. Unfortunately nothing more is said about how those threads should be bound and that's what a special working group in the OpenMP language comittee is addressing right now. OpenMP 4.0 will come with new envrionment variables and clauses that will allow one to specify how to distribute the threads. Of course, many OpenMP implementations offer their own non-standard methods to control binding.
Still most OpenMP runtimes are not NUMA aware. They will happily dispatch threads to any available CPU and you would have to make sure that each thread only access data that belongs to it. There are some general hints in this direction:
dynamic
scheduling for parallel for
(C/C++) / DO
(Fortran) loops.for
loops with the same team size and the same number of iteration chunks, with static
scheduling chunk 0 of both loops will be executed by thread 0, chunk 1 - by thread 1, and so on.Some colleagues of mine have thoroughly evaluated the NUMA behavious of different OpenMP runtimes and have specifically looked into the NUMA awareness of the Intel's implementation, but the articles are not published yet so I cannot provide you with a link.
There is one research project, called ForestGOMP, which aims at providing a NUMA-aware drop-in replacement for libgomp
. May be you should give it a look.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With