I've tried to measure the asymmetric memory access effects of NUMA, and failed.
Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.
On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.
Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.
Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.
There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.
Why?
Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?
http://lse.sourceforge.net/numa/faq/
#include <numa.h> #include <iostream> #include <boost/thread/thread.hpp> #include <boost/date_time/posix_time/posix_time.hpp> #include <pthread.h> void pin_to_core(size_t core) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core, &cpuset); pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); } std::ostream& operator<<(std::ostream& os, const bitmask& bm) { for(size_t i=0;i<bm.size;++i) { os << numa_bitmask_isbitset(&bm, i); } return os; } void* thread1(void** x, size_t core, size_t N, size_t M) { pin_to_core(core); void* y = numa_alloc_local(N); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { c = ((char*)y)[j]; ((char*)y)[j] = c; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl; *x = y; } void thread2(void* x, size_t core, size_t N, size_t M) { pin_to_core(core); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { c = ((char*)x)[j]; ((char*)x)[j] = c; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl; } int main(int argc, const char **argv) { int numcpus = numa_num_task_cpus(); std::cout << "numa_available() " << numa_available() << std::endl; numa_set_localalloc(); bitmask* bm = numa_bitmask_alloc(numcpus); for (int i=0;i<=numa_max_node();++i) { numa_node_to_cpus(i, bm); std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl; } numa_bitmask_free(bm); void* x; size_t N(10000000); size_t M(50); boost::thread t1(boost::bind(&thread1, &x, 0, N, M)); t1.join(); for (size_t i(0);i<numcpus;++i) { boost::thread t2(boost::bind(&thread2, x, i, N, M)); t2.join(); } numa_free(x, N); return 0; }
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp ./numatest numa_available() 0 <-- NUMA is available on this system numa node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb numa node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0 Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428 Elapsed read/write by thread on core 0: 00:00:01.760554 Elapsed read/write by thread on core 1: 00:00:01.719686 Elapsed read/write by thread on core 2: 00:00:01.708830 Elapsed read/write by thread on core 3: 00:00:01.691560 Elapsed read/write by thread on core 4: 00:00:01.686912 Elapsed read/write by thread on core 5: 00:00:01.691917 Elapsed read/write by thread on core 6: 00:00:01.686509 Elapsed read/write by thread on core 7: 00:00:01.689928
Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.
The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.
I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.
NUMA allows for fast and easy access to memory for the processors, as opposed to shared memory architectures where access times may be longer, thus slowing down execution of key processor and system tasks.
NUMA (non-uniform memory access) is a method of configuring a cluster of microprocessor in a multiprocessing system so that they can share memory locally, improving performance and the ability of the system to be expanded. NUMA is used in a symmetric multiprocessing ( SMP ) system.
The current Microsoft guidance “In most cases you can determine your NUMA node boundaries by dividing the amount of physical RAM by the number of logical processors (cores).
The NUMA-aware architecture is a hardware design which separates its cores into multiple clusters where each cluster has its own local memory region and still allows cores from one cluster to access all memory in the system.
The first thing I want to point out is that you might want to double-check which cores are on each node. I don't recall cores and nodes being interleaved like that. Also, you should have 16 threads due to HT. (unless you disabled it)
Another thing:
The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.
On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.
Last thing:
Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn't do anything:
c = ((char*)x)[j]; ((char*)x)[j] = c;
Something like this will guarantee that it won't be eliminated by the compiler:
((char*)x)[j] += 1;
Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.
If it were a cache optimization, then forcing a memory barrier would defeat the optimization:
c = __sync_fetch_and_add(((char*)x) + j, 1);
but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:
*(((char*)x) + ((j * 1009) % N)) += 1;
With that change, the NUMA asymmetry is clearly revealed:
numa_available() 0 numa node 0 10101010 12884901888 numa node 1 01010101 12874584064 Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725 Elapsed read/write by thread on core 0: 00:00:00.942300 Elapsed read/write by thread on core 1: 00:00:01.216286 Elapsed read/write by thread on core 2: 00:00:00.909353 Elapsed read/write by thread on core 3: 00:00:01.218935 Elapsed read/write by thread on core 4: 00:00:00.898107 Elapsed read/write by thread on core 5: 00:00:01.211413 Elapsed read/write by thread on core 6: 00:00:00.898021 Elapsed read/write by thread on core 7: 00:00:01.207114
At least I think that's what's going on.
Thanks Mysticial!
EDIT: CONCLUSION ~133%
For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:
Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With