Hard page faults The best indicator of a memory bottleneck is a sustained, high rate of hard page faults. Hard page faults occur when the data that a program requires is not found in its working set (the physical memory visible to the program) or elsewhere in physical memory, and must be retrieved from disk.
Abstract: As the speed gap between CPU and memory widens, memory hierarchy has become the primary factor limiting program performance. Until now, the principal focus of hardware and software innovations has been overcoming latency.
The theoretical maximum memory bandwidth for Intel Core X-Series Processors can be calculated by multiplying the memory frequency (one half since double data rate x 2), multiplied by the number of the bytes of width, and multiplied by the number of the channels supported for the processor.
A memory bottleneck refers to a memory shortage due to insufficient memory, memory leaks, defective programs or when slow memory is used in a fast processor system. A memory bottleneck affects the machine's performance by slowing down the movement of data between the CPU and the RAM.
I have had this problem myself on a NUMA 96x8 cores machine.
90% of the time the problem is with memory/cache synchronisation. If you call synchronisation routines frequently (atomics, mutexes) then the appropriate cache line has to be invalidated on all sockets leading to a complete lockdown of the entire memory bus for multiple cycles.
You can profile this by running a profiler like Intel VTune or Perfsuite and have them record how long your atomics take. If you are using them properly then they should take something between 10-40 cycles. Worst case scenario I had was 300 cycles when scaling my multithreaded application to 8 sockets (8x8 cores on Intel Xeon).
Another easy profiling step you can do is compile without any atomics/mutexes (if your code permits it) and run it on multiple sockets then - it should run fast (incorrect, but fast).
The reason why your code runs fast on 8 cores is because Intel processors are using cache locking when executing atomics as long as you keep all on the same physical chip (socket). If a lock has to go to the memory bus - this is when things get ugly.
The only thing I can suggest is: scale down on how often you call atomics/synchronisation routines.
As for my application: I had to implement a virtually lock-free data structure in order to scale my code beyond one socket. Every thread accumulates actions that require a lock and checks regularly it it's his turn to flush them. Then pass a token around and take turns flushing the synchronisation actions. Obviously only works if you have sufficient work to do while waiting.
+1 for good question.
First I want to say there're other factors to consider about, e.g. the cache synchronization, or the unavoidable serialization part like atomic memory operations, which are also possible bottlenecks and easier to verify than memory bandwidth.
As for memory bandwidth, what I'm having now is a naive idea which is to launch a simple daemon to consume the memory bandwidth while profiling your application, by simply repeating accessing the main memory (be sure to consider the existence of cache). With the daemon you can adjust and log the memory bandwidth it consumes and compare this result with the performance of your application..
Sorry for providing such a sloppy answer.. although it's doable XD
EDITED: Also see How to measure memory bandwidth currently being used on Linux? and How can I observe memory bandwidth?
While it would be helpful to have more information on the algorithm and the platform, there are in general a number of reasons why an application does not scale:
Use of explicit synchronization (mutexes/atomics/transactions etc): synchronization in a parallel program means that you create some sequential sections when you have to share a resource between multiple threads. The more threads that want to access the critical section (an atomic operation is really a very small critical section) the more contention you have and the more your scalability is limited, since the cores are taking turns into the critical section. Reducing the size of the critical sections and choosing different data structures/algorithms can mitigate that if privatizing of the resource is not possible.
False sharing: two or more threads sharing unrelated objects that happen to end up in the same cache block. It's normally easy to detect by seeing increased cache misses as you scale your application from one core to more and from one socket to more that one sockets. Aligning your data structures to the cache block size normally solves that. See also Eliminate False Sharing - Dr Dobb's
Memory allocation/deallocation: while memory allocation will give you memory chunks for different to work on, you might have contention either at the allocation or even deallocation. Can be solved by using a scalable thread-safe memory allocator such as Intel TBB's scalable allocator, Hoard and others.
Idle threads: Is your algorithm having a producer/consumer pattern and could it be that you are consuming faster than you produce? Is your data size big enough so that you amortize the cost of parallelization and that you don't lose speed by losing locality? Is your algorithm inherently unscalable for any other reason? You probably have to tell us something more about your platform and your algorithm. Intel Advisor is a decent tool to check what is the best way to parallelize.
Parallel framework: what are you using? OpenMP, Intel TBB, something else? Pure threads? Do you maybe fork/join too much or overpartitioning your problem? Is your runtime scalable itself?
Other technical reasons: incorrect binding of threads to cores (maybe multiple threads end up on the same core), features of the parallel runtime (Intel's OpenMP runtime has an additional hidden thread, doing thread to core binding can map this additional thread on the same core as the main thread, ruining your day) etc.
From my experience, I find that once you have eliminated all the above, then you can start suspecting memory bandwidth. You can check it easily with STREAM that can tell you if your memory bandwidth is the limiting factor. There is this article in Intel's website that explains how to detect memory bandwidth saturation.
If none of the above is conclusive, you might actually have limited scalability by coherence protocol traffic and/or NUMA (Non-Uniform Memory Access, a good article in acmqueue). Whenever you access some object in memory, you are either generating cache invalidation requests (you are sharing something and the cache coherence protocol kicks in) or you are accessing memory that lives in a bank closer to another socket (you are going through the processor interconnect).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With