Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

I've tried to measure the asymmetric memory access effects of NUMA, and failed.

The Experiment

Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.

On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.

Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.

Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.

There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.

Why?

Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h> #include <iostream> #include <boost/thread/thread.hpp> #include <boost/date_time/posix_time/posix_time.hpp> #include <pthread.h>  void pin_to_core(size_t core) {     cpu_set_t cpuset;     CPU_ZERO(&cpuset);     CPU_SET(core, &cpuset);     pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); }  std::ostream& operator<<(std::ostream& os, const bitmask& bm) {     for(size_t i=0;i<bm.size;++i)     {         os << numa_bitmask_isbitset(&bm, i);     }     return os; }  void* thread1(void** x, size_t core, size_t N, size_t M) {     pin_to_core(core);      void* y = numa_alloc_local(N);      boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();      char c;     for (size_t i(0);i<M;++i)         for(size_t j(0);j<N;++j)         {             c = ((char*)y)[j];             ((char*)y)[j] = c;         }      boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();      std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;      *x = y; }  void thread2(void* x, size_t core, size_t N, size_t M) {     pin_to_core(core);      boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();      char c;     for (size_t i(0);i<M;++i)         for(size_t j(0);j<N;++j)         {             c = ((char*)x)[j];             ((char*)x)[j] = c;         }      boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();      std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl; }  int main(int argc, const char **argv) {     int numcpus = numa_num_task_cpus();     std::cout << "numa_available() " << numa_available() << std::endl;     numa_set_localalloc();      bitmask* bm = numa_bitmask_alloc(numcpus);     for (int i=0;i<=numa_max_node();++i)     {         numa_node_to_cpus(i, bm);         std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;     }     numa_bitmask_free(bm);      void* x;     size_t N(10000000);     size_t M(50);      boost::thread t1(boost::bind(&thread1, &x, 0, N, M));     t1.join();      for (size_t i(0);i<numcpus;++i)     {         boost::thread t2(boost::bind(&thread2, x, i, N, M));         t2.join();     }      numa_free(x, N);      return 0; } 

The Output

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp  ./numatest  numa_available() 0                    <-- NUMA is available on this system numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0  Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428 Elapsed read/write by thread on core 0: 00:00:01.760554 Elapsed read/write by thread on core 1: 00:00:01.719686 Elapsed read/write by thread on core 2: 00:00:01.708830 Elapsed read/write by thread on core 3: 00:00:01.691560 Elapsed read/write by thread on core 4: 00:00:01.686912 Elapsed read/write by thread on core 5: 00:00:01.691917 Elapsed read/write by thread on core 6: 00:00:01.686509 Elapsed read/write by thread on core 7: 00:00:01.689928 

Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.

Update:

The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.

Update 2:

I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.

like image 750
James Brock Avatar asked Aug 31 '11 15:08

James Brock


People also ask

Why the Non-Uniform Memory Access NUMA improves the performance of modern multicore architecture?

NUMA allows for fast and easy access to memory for the processors, as opposed to shared memory architectures where access times may be longer, thus slowing down execution of key processor and system tasks.

What do you understand by Non-Uniform Memory Access NUMA architecture explain?

NUMA (non-uniform memory access) is a method of configuring a cluster of microprocessor in a multiprocessing system so that they can share memory locally, improving performance and the ability of the system to be expanded. NUMA is used in a symmetric multiprocessing ( SMP ) system.

How is NUMA determined?

The current Microsoft guidance “In most cases you can determine your NUMA node boundaries by dividing the amount of physical RAM by the number of logical processors (cores).

What does NUMA aware mean?

The NUMA-aware architecture is a hardware design which separates its cores into multiple clusters where each cluster has its own local memory region and still allows cores from one cluster to access all memory in the system.


2 Answers

The first thing I want to point out is that you might want to double-check which cores are on each node. I don't recall cores and nodes being interleaved like that. Also, you should have 16 threads due to HT. (unless you disabled it)

Another thing:

The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.

On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.

Last thing:

Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn't do anything:

c = ((char*)x)[j]; ((char*)x)[j] = c; 

Something like this will guarantee that it won't be eliminated by the compiler:

((char*)x)[j] += 1; 
like image 57
Mysticial Avatar answered Oct 08 '22 11:10

Mysticial


Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.

If it were a cache optimization, then forcing a memory barrier would defeat the optimization:

c = __sync_fetch_and_add(((char*)x) + j, 1); 

but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:

*(((char*)x) + ((j * 1009) % N)) += 1; 

With that change, the NUMA asymmetry is clearly revealed:

numa_available() 0 numa node 0 10101010 12884901888 numa node 1 01010101 12874584064 Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725 Elapsed read/write by thread on core 0: 00:00:00.942300 Elapsed read/write by thread on core 1: 00:00:01.216286 Elapsed read/write by thread on core 2: 00:00:00.909353 Elapsed read/write by thread on core 3: 00:00:01.218935 Elapsed read/write by thread on core 4: 00:00:00.898107 Elapsed read/write by thread on core 5: 00:00:01.211413 Elapsed read/write by thread on core 6: 00:00:00.898021 Elapsed read/write by thread on core 7: 00:00:01.207114 

At least I think that's what's going on.

Thanks Mysticial!

EDIT: CONCLUSION ~133%

For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:

Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.

like image 27
James Brock Avatar answered Oct 08 '22 11:10

James Brock