memory allocation and access on NUMA hardware

Tags:

I am developing a scientific computing tool in python, that should be capable of distributing work over multiple cores in a NUMA shared memory environment. I am looking into the most efficient way of doing this.

Threads are -unfortunately- out of the game because of python's global interpreter lock, which leaves a fork as my only option. For inter process communication I suppose my options are pipes, sockets, or mmap. Please point it out if things are missing in this list.

My application will require quite some communication between processes, and access to some amount of common data. My main concern is latency.

My questions: when I fork a process, will its memory be located near the core it is assigned to? As fork in *nix copies on write, initially I suppose this cannot be the case. Do I want to force a copy for faster memory access, and if so, what is the best way do that? If I use mmap for communication, can that memory still be distributed over the cores or will it be located at a single one? Is there a process that transparently relocates data to optimize access? Is there a way to have direct control over physical allocation, or a way to request information about allocation to aid optimization?

On a higher level, which of these things are determined my hardware and which by the operating system? I am in the process of buying a high-end multisocket machine and doubting between AMD Opteron and Intel Xeon. What are the implications of specific hardware on any of the questions above?

769

asked Oct 16 '11 11:10

gertjan

1 Answers

Since one of Python's achilles heels is the GIL there is better multiprocess support. For example there are Queues, Pipes, Locks, shared values and shared arrays. There is also something called a Manager which allows you to wrap a lot of Python data structures and share them in an IPC-friendly kind of way. I imagine that most of these work via pipes or sockets but I haven't delved too deeply into the internals.

http://docs.python.org/2/library/multiprocessing.html

How does Linux model NUMA systems?

The kernel detects that it's running on a multi-core machine and then detects how much hardware there is and what the topology is. It then creates a model of this topology using the idea of Nodes. A Node is a physical socket which contains a CPU (possibly with multiple cores) and the memory that is attached to it. Why Node based instead of core based? Because a memory bus is the physical wires that connect RAM to a CPU socket and all cores on a CPU in a single socket will have the same access time to all the RAM which resides on that memory bus.

How is memory on one memory bus accessed by a core on another memory bus?

On x86 systems this is through the caches. Modern OSes use a piece of hardware called the Translation Lookaside Buffer (TLB) to map virtual addresses to physical addresses. If the memory that the cache has been tasked to get is local it's read locally. If it's not local it will go over the Hyper Transport bus on AMD systems or QuickPath on Intel to the remote memory to be satisfied. Since it's done at the cache-level you theoretically don't need to know about it. And you certainly don't have any control over it. But for high performance applications it's incredibly useful to understand to minimize the amount of remote accesses.

Where does the OS actually locate the physical pages of virtual memory?

When a process is forked it inherits all it's parent's pages (due to COW). The kernel has an idea of which node is "best" for the process which is it's "preferred" node. This can be modified but again will default to the same as the parent. Memory allocation will default to the same node as the parent unless it's explicitly changed.

Is there a transparent process that moves memory?

No. Once memory is allocated it's fixed to the node it was allocated on. You can make a new allocation on another node, move the data, and deallocate on the first node, but it's a bit of a chore.

Is there a way to have control over allocation?

The default is to allocate to the local node. If you use libnuma you can change how the allocation is done (say round-robin or interleaved) instead of defaulting to local.

I have taken a lot of information from this blog post:

http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

I would definitely recommend that you read it in it's entirety to glean additional information.

180

answered Oct 28 '22 17:10

Mike Sandford

Related questions
                            
                                Error exit status 2 trying to install PIL with pip in virtualenv on windows 7
                            
                                Making an numpy array subclass where all shape changing operations return a normal array
                            
                                Is there a way to create a separate display and input on the same terminal using curse?
                            
                                Google App Engine: Reverse Proxy + OpenID, users being redirected to appspot domain after login
                            
                                Simple registration+login using OAuth 2.0
                            
                                What setup is need to compile rpy2 on Windows?
                            
                                Scrapy output feed international unicode characters (e.g. Japanese chars)
                            
                                more descriptive error message than "SyntaxError: invalid syntax"
                            
                                Want to use Python Libraries in Android
                            
                                chaining Popen subprocesses properly
                            
                                OpenCV/C++ program slower than its numpy counterpart, what should I do?
                            
                                Using XLRD module and Python to determine cell font style (italics or not)
                            
                                Python memory leaks trackdown?
                            
                                Dynamically add/remove threads to the worker pool in celery
                            
                                Tornado code deployment
                            
                                Sphinx floating point formatting
                            
                                Which haml-in-django implementation?
                            
                                Use a class in the context of a different module
                            
                                Symbolic manipulation over non-numeric types
                            
                                interactive _standalone_ output from matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

memory allocation and access on NUMA hardware

Tags:

python

fork

ipc

shared-memory

numa