Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

memory allocation and access on NUMA hardware

I am developing a scientific computing tool in python, that should be capable of distributing work over multiple cores in a NUMA shared memory environment. I am looking into the most efficient way of doing this.

Threads are -unfortunately- out of the game because of python's global interpreter lock, which leaves a fork as my only option. For inter process communication I suppose my options are pipes, sockets, or mmap. Please point it out if things are missing in this list.

My application will require quite some communication between processes, and access to some amount of common data. My main concern is latency.

My questions: when I fork a process, will its memory be located near the core it is assigned to? As fork in *nix copies on write, initially I suppose this cannot be the case. Do I want to force a copy for faster memory access, and if so, what is the best way do that? If I use mmap for communication, can that memory still be distributed over the cores or will it be located at a single one? Is there a process that transparently relocates data to optimize access? Is there a way to have direct control over physical allocation, or a way to request information about allocation to aid optimization?

On a higher level, which of these things are determined my hardware and which by the operating system? I am in the process of buying a high-end multisocket machine and doubting between AMD Opteron and Intel Xeon. What are the implications of specific hardware on any of the questions above?

like image 769
gertjan Avatar asked Oct 16 '11 11:10

gertjan


People also ask

Does NUMA share memory?

In a NUMA setup, the individual processors in a computing system share local memory and can work together. Data can flow smoothly and quickly since it goes through intermediate memory instead of a main bus. NUMA can be thought of as a microprocessor cluster in a box.

What is NUMA hardware?

Non-uniform memory access (NUMA) is a modern design for computer memory access, which was designed to overcome the scalability limits of the Symmetric Multi-Processor (SMP) architecture. In SMP, each core accesses its own bus and its own I/O hub.

What is NUMA memory architecture?

Non-uniform memory access (NUMA) is a kind of memory architecture that allows a processor faster access to contents of memory than other traditional techniques. In other words, in a NUMA architecture, a processor can access local memory much faster than non-local memory.

What is the difference between UMA and NUMA memory access?

Uniform Memory Access (UMA): identical processors have equal access times to memory. This architecture is used by symmetric multiprocessor (SMP) computers. (2) Non-Uniform Memory Access (NUMA): many SMPs are linked, and one SMP can directly access the memory of another SMP.


1 Answers

Since one of Python's achilles heels is the GIL there is better multiprocess support. For example there are Queues, Pipes, Locks, shared values and shared arrays. There is also something called a Manager which allows you to wrap a lot of Python data structures and share them in an IPC-friendly kind of way. I imagine that most of these work via pipes or sockets but I haven't delved too deeply into the internals.

http://docs.python.org/2/library/multiprocessing.html

How does Linux model NUMA systems?

The kernel detects that it's running on a multi-core machine and then detects how much hardware there is and what the topology is. It then creates a model of this topology using the idea of Nodes. A Node is a physical socket which contains a CPU (possibly with multiple cores) and the memory that is attached to it. Why Node based instead of core based? Because a memory bus is the physical wires that connect RAM to a CPU socket and all cores on a CPU in a single socket will have the same access time to all the RAM which resides on that memory bus.

How is memory on one memory bus accessed by a core on another memory bus?

On x86 systems this is through the caches. Modern OSes use a piece of hardware called the Translation Lookaside Buffer (TLB) to map virtual addresses to physical addresses. If the memory that the cache has been tasked to get is local it's read locally. If it's not local it will go over the Hyper Transport bus on AMD systems or QuickPath on Intel to the remote memory to be satisfied. Since it's done at the cache-level you theoretically don't need to know about it. And you certainly don't have any control over it. But for high performance applications it's incredibly useful to understand to minimize the amount of remote accesses.

Where does the OS actually locate the physical pages of virtual memory?

When a process is forked it inherits all it's parent's pages (due to COW). The kernel has an idea of which node is "best" for the process which is it's "preferred" node. This can be modified but again will default to the same as the parent. Memory allocation will default to the same node as the parent unless it's explicitly changed.

Is there a transparent process that moves memory?

No. Once memory is allocated it's fixed to the node it was allocated on. You can make a new allocation on another node, move the data, and deallocate on the first node, but it's a bit of a chore.

Is there a way to have control over allocation?

The default is to allocate to the local node. If you use libnuma you can change how the allocation is done (say round-robin or interleaved) instead of defaulting to local.

I have taken a lot of information from this blog post:

http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

I would definitely recommend that you read it in it's entirety to glean additional information.

like image 180
Mike Sandford Avatar answered Oct 28 '22 17:10

Mike Sandford