I am experimenting with NUMA on a machine that has 4 Operton 6272 processors, running centOS. There are 8 NUMA nodes, each with 16GB memory.
Here is a small test program I'm running.
void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
int main()
{
pin_to_core( 0 );
size_t bufSize = 100;
for( int i = 0; i < 131000; ++i )
{
if( !(i % 10) )
{
std::cout << i << std::endl;
long long free = 0;
for( unsigned j = 0; j < 8; ++j )
{
numa_node_size64( j, &free );
std::cout << "Free on node " << j << ": " << free << std::endl;
}
}
char* buf = (char*)numa_alloc_onnode( bufSize, 5 );
for( unsigned j = 0; j < bufSize; ++j )
buf[j] = j;
}
return 0;
}
So basically a thread running on core #0 allocates 131K 100-byte buffers on NUMA node 5, initializes them with junk and leaks them. Once every 10 iterations we print out information about how much memory is available on each NUMA node.
In the beginning of the output I get:
0
Free on node 0: 16115879936
Free on node 1: 16667398144
Free on node 2: 16730402816
Free on node 3: 16529108992
Free on node 4: 16624508928
Free on node 5: 16361529344
Free on node 6: 16747118592
Free on node 7: 16631336960
...
And at the end I'm getting:
Free on node 0: 15826657280
Free on node 1: 16667123712
Free on node 2: 16731033600
Free on node 3: 16529358848
Free on node 4: 16624885760
Free on node 5: 16093630464
Free on node 6: 16747384832
Free on node 7: 16631332864
130970
Free on node 0: 15826657280
Free on node 1: 16667123712
Free on node 2: 16731033600
Free on node 3: 16529358848
Free on node 4: 16624885760
Free on node 5: 16093630464
Free on node 6: 16747384832
Free on node 7: 16631332864
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
130980
...
Things that are not clear to me:
1) Why are there those "mbind: Cannot allocate memory" messages? The fact that I'm far from using up all of the memory and the behaviour doesn't change if I change the buffer size to, say, 1000 leads me to think that I'm running out of some kind of a kernel resource handles.
2) Even though I asked for the memory to be allocated on node 5, the actual allocations seem to have been split between nodes 0 and 5.
Can anyone please give any insights into why this is happening?
UPDATE
Would like to give more detail on point (2). The fact that some of the memory isn't allocated on node 5 seems to have something to do with the fact that we are initializing the buffer on core #0 (that belongs to NUMA node 0). If I change pin_to_core(0)
to pin_to_core(8)
then the allocated memory is split between nodes 1 and 5. If it is pin_to_core(40)
then all the memory is allocated on node 5.
UPDATE2
I've looked at the source code of libnuma and tried replacing the call to numa_alloc_onnode()
with more low-level calls from there: mmap()
and mbind()
. I'm now also checking on which NUMA node does the memory reside - I use the move_pages()
call for that. The results are as follows. Before initialization (the loop over j
) the page is not mapped to any node (I get ENOENT error code) and after initialization the page is assigned either to node 0 or to node 5. The pattern is regular: 5,0,5,0,... As before, when we get close to the 131000-th iteration the calls to mbind()
start returning error codes, and when this happens the page is always allocated to node 0. The error code returned by mbind is ENOMEM, the documentation says this means running out of "kernel memory". I don't know what it is, but it can't be "physical" memory because I have 16GB per node.
So here are my conclusions so far:
The restrictions on memory mapping imposed by mbind()
are held up only 50% of the times when a core of another NUMA node touches memory first. I wish this was documented somewhere, because quietly breaking a promise is not nice...
There is a limit on the number of calls to mbind
. So one should mbind()
big memory chunks whenever possible.
The approach that I'm going to try is: do memory allocation tasks on threads that are pinned to cores of particular NUMA ndoes. For extra peace of mind I will try calling mlock (because of issues described here).
As you have already discovered from reading libnuma.c
, each call to numa_alloc_onnode()
creates a new anonymous memory map and then binds the memory region to the specified NUMA node. With so many invocations of mmap()
you are simply hitting the maximum number of memory mappings per process allowed. The value could be read from /proc/sys/vm/max_map_count
and also could be modified by the system administrator either by writing to the pseudofile:
# echo 1048576 > /proc/sys/vm/max_map_count
or with sysctl
:
# sysctl -w vm.max_map_count=1048576
The default on may Linux distributions is 65530
mappings. mmap()
implements mapping coalescing, i.e. it first tries to extend an existing mapping before creating a new one. In my tests it creates a new mapping in every second invocation and otherwise extends the previous one. Before the first call to numa_alloc_onnode()
my test processes have 37 mappings. Therefore mmap()
should start failing somewhere after 2 * (65530-37) = 130986
calls.
It looks like that when mbind()
is applied to a part of an existing mapping, something strange happens and the newly affected region is not bound properly. I have to dig into the kernel source code in order to find out why. On the other hand, if you replace:
numa_alloc_onnode( bufSize, 5 )
with
numa_alloc_onnode( bufSize, i % 4 )
no mappings coalescing is performed and mmap()
fails around the 65500-th iteration and all allocations are properly bound.
For your first question, from the man page of numa_alloc_onnode
The size argument will be rounded up to a multiple of the system page size.
That means that although you are requesting a small amount of data you are getting whole pages. That said, in you program you are actually requesting 131000 system pages.
For your second question i suggest using numa_set_strict()
to force numa_alloc_onnode
to fail if it cannot allocate a page on the given node.
numa_set_strict() sets a flag that says whether the functions allocating
on specific nodes should use use a strict policy. Strict means the
allocation will fail if the memory cannot be allocated on the target
node. Default operation is to fall back to other nodes. This doesn't
apply to interleave and default.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With