Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb in Docker: numactl --interleave=all explanation

I'm trying to create Dockerfile for in-memory MongoDB based on official repo at https://hub.docker.com/_/mongo/.

In dockerfile-entrypoint.sh I've encountered:

numa='numactl --interleave=all'
if $numa true &> /dev/null; then
    set -- $numa "$@"
fi

Basically it prepends numactl --interleave=all to original docker command, when numactl exists.

But I don't really understand this NUMA policy thing. Can you please explain what NUMA really means, and what --interleave=all stands for?

And why do we need to use it to create MongoDB instance?

like image 323
mkurnikov Avatar asked Oct 24 '15 08:10

mkurnikov


People also ask

What are the minimum x86_64 microarchitectures that MongoDB requires?

MongoDB requires the following minimum x86_64 microarchitectures: For Intel x86_64, MongoDB requires Sandy Bridge or later. For AMD x86_64, MongoDB requires Bulldozer or later. Starting in MongoDB 5.0, mongod, mongos, and the legacy mongo shell no longer support x86_64 platforms which do not meet this minimum microarchitecture requirement.

What are the requirements for mongod to run on dbpath?

The files in the dbPath directory must correspond to the configured storage engine. mongod will not start if dbPath contains data files created by a storage engine other than the one specified by --storageEngine. mongod must possess read and write permissions for the specified dbPath.

What Linux kernel version should I use for MongoDB?

When running MongoDB in production on Linux, you should use Linux kernel version 2.6.36 or later, with either the XFS or EXT4 filesystem. If possible, use XFS as it generally performs better with MongoDB.

How to add zone_reclaim_mode to MongoDB?

First of all, since the zone_reclaim_mode setting is system-wide rather than specific to mongodb, you could add the line: to /etc/sysctl.conf, which configures system variables on startup.


1 Answers

The man page mentions:

The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.

This isn't available for all architectures, which is why issue 14 made sure to call numa only on numa machine.

As explained in "Set default numa policy to “interleave” system wide":

It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.


The interleave=all alleviates the kind of issue met by app like cassandra (a distributed database for managing large amounts of structured data across many commodity servers):

By default, Linux attempts to be smart about memory allocations such that data is close to the NUMA node on which it runs. For big database type of applications, this is not the best thing to do if the priority is to avoid disk I/O. In particular with Cassandra, we're heavily multi-threaded anyway and there is no particular reason to believe that one NUMA node is "better" than another.

Consequences of allocating unevenly among NUMA nodes can include excessive page cache eviction when the kernel tries to allocate memory - such as when restarting the JVM.

For more, see "The MySQL “swap insanity” problem and the effects of the NUMA architecture"

Without numa

In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward.
The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node.

http://jcole.us/blog/files/numa-imbalanced-allocation.png

With Numa:

An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:

# numactl --interleave all command

I mentioned in the comments that numa enumerates the hardware to understand the physical layout. And then divides the processors (not cores) into “nodes”.
With modern PC processors, this means one node per physical processor, regardless of the number of cores present.

That is bit of an over-simplification, as Hristo Iliev points out:

AMD Opteron CPUs with larger number of cores are actually 2-way NUMA systems on their own with two HT (HyperTransport)-interconnected dies with own memory controllers in a single physical package.
Also, Intel Haswell-EP CPUs with 10 or more cores come with two cache-coherent ring networks and two memory controllers and can be operated in a cluster-on-die mode, which presents itself as a two-way NUMA system.

It is wiser to say that a NUMA node is some cores that can reach some memory directly without going through a HT, QPI (QuickPath_Interconnect), NUMAlink or some other interconnect.

http://jcole.us/blog/files/numa-balanced-allocation.png

like image 195
VonC Avatar answered Oct 18 '22 03:10

VonC