Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

The Experiment

Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.

On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.

Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.

Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.

There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.

Why?

Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

Click to copy

#include <numa.h> #include <iostream> #include <boost/thread/thread.hpp> #include <boost/date_time/posix_time/posix_time.hpp> #include <pthread.h>  void pin_to_core(size_t core) {     cpu_set_t cpuset;     CPU_ZERO(&cpuset);     CPU_SET(core, &cpuset);     pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); }  std::ostream& operator<<(std::ostream& os, const bitmask& bm) {     for(size_t i=0;i<bm.size;++i)     {         os << numa_bitmask_isbitset(&bm, i);     }     return os; }  void* thread1(void** x, size_t core, size_t N, size_t M) {     pin_to_core(core);      void* y = numa_alloc_local(N);      boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();      char c;     for (size_t i(0);i<M;++i)         for(size_t j(0);j<N;++j)         {             c = ((char*)y)[j];             ((char*)y)[j] = c;         }      boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();      std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;      *x = y; }  void thread2(void* x, size_t core, size_t N, size_t M) {     pin_to_core(core);      boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();      char c;     for (size_t i(0);i<M;++i)         for(size_t j(0);j<N;++j)         {             c = ((char*)x)[j];             ((char*)x)[j] = c;         }      boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();      std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl; }  int main(int argc, const char **argv) {     int numcpus = numa_num_task_cpus();     std::cout << "numa_available() " << numa_available() << std::endl;     numa_set_localalloc();      bitmask* bm = numa_bitmask_alloc(numcpus);     for (int i=0;i<=numa_max_node();++i)     {         numa_node_to_cpus(i, bm);         std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;     }     numa_bitmask_free(bm);      void* x;     size_t N(10000000);     size_t M(50);      boost::thread t1(boost::bind(&thread1, &x, 0, N, M));     t1.join();      for (size_t i(0);i<numcpus;++i)     {         boost::thread t2(boost::bind(&thread2, x, i, N, M));         t2.join();     }      numa_free(x, N);      return 0; }

The Output

Click to copy

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp  ./numatest  numa_available() 0                    <-- NUMA is available on this system numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0  Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428 Elapsed read/write by thread on core 0: 00:00:01.760554 Elapsed read/write by thread on core 1: 00:00:01.719686 Elapsed read/write by thread on core 2: 00:00:01.708830 Elapsed read/write by thread on core 3: 00:00:01.691560 Elapsed read/write by thread on core 4: 00:00:01.686912 Elapsed read/write by thread on core 5: 00:00:01.691917 Elapsed read/write by thread on core 6: 00:00:01.686509 Elapsed read/write by thread on core 7: 00:00:01.689928

Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.

Update:

The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.

Update 2:

I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.

750

asked Aug 31 '11 15:08

James Brock

2 Answers

The first thing I want to point out is that you might want to double-check which cores are on each node. I don't recall cores and nodes being interleaved like that. Also, you should have 16 threads due to HT. (unless you disabled it)

Another thing:

The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.

On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.

Last thing:

Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn't do anything:

Click to copy

c = ((char*)x)[j]; ((char*)x)[j] = c;

Something like this will guarantee that it won't be eliminated by the compiler:

Click to copy

((char*)x)[j] += 1;

answered Oct 08 '22 11:10

Mysticial

Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.

If it were a cache optimization, then forcing a memory barrier would defeat the optimization:

Click to copy

c = __sync_fetch_and_add(((char*)x) + j, 1);

but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:

Click to copy

*(((char*)x) + ((j * 1009) % N)) += 1;

With that change, the NUMA asymmetry is clearly revealed:

Click to copy

numa_available() 0 numa node 0 10101010 12884901888 numa node 1 01010101 12874584064 Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725 Elapsed read/write by thread on core 0: 00:00:00.942300 Elapsed read/write by thread on core 1: 00:00:01.216286 Elapsed read/write by thread on core 2: 00:00:00.909353 Elapsed read/write by thread on core 3: 00:00:01.218935 Elapsed read/write by thread on core 4: 00:00:00.898107 Elapsed read/write by thread on core 5: 00:00:01.211413 Elapsed read/write by thread on core 6: 00:00:00.898021 Elapsed read/write by thread on core 7: 00:00:01.207114

At least I think that's what's going on.

Thanks Mysticial!

EDIT: CONCLUSION ~133%

For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:

Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.

answered Oct 08 '22 11:10

James Brock

Related questions
                            
                                Convert const char* to wstring
                            
                                Use of (void) before a function call [duplicate]
                            
                                How can I emit a signal from another class?
                            
                                How to print UTF-8 strings to std::cout on Windows?
                            
                                How to read Linux environment variables in c++
                            
                                Where in Qt Creator do I pass arguments to a compiler?
                            
                                Confusion with commas in ternary expression
                            
                                When is C++ covariance the best solution?
                            
                                What is the difference between Go's multithreading and pthread or Java Threads?
                            
                                Which type of sorting is used in the std::sort()?
                            
                                generic member function pointer as a template parameter
                            
                                How to use setprecision in C++
                            
                                wait and notify in C/C++ shared memory
                            
                                Why and how should I use namespaces in C++?
                            
                                Most efficient way to escape XML/HTML in C++ string?
                            
                                How do I make a call to what() on std::exception_ptr
                            
                                On local and global static variables in C++
                            
                                How to use std::atomic<>
                            
                                boost::asio cleanly disconnecting
                            
                                Can I selectively (force) inline a function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

Tags:

c++

performance

linux

linux-kernel

numa

The Experiment

numatest.cpp

The Output

Update:

Update 2:

James Brock

People also ask

2 Answers

Mysticial

James Brock

Recent Activity

Donate For Us