Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Severe multi-threaded memory bottleneck after reaching a specific number of cores

We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.

The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.

Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.

We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:

  • CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
  • Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
  • The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
  • The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:

    Instructions: 5.56 (times more)
    Clock cycles: 10.39
    Memory operations: 4.58
    Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
    Avg cache miss latency: 8.92
    Total cache miss latency: 6.69
    Mem bank load conflict: 11.32
    Mem bank store conflict: 2.73
    Mem forwarded: 7.42

We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).

So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?

like image 831
Alias Avatar asked Oct 25 '22 06:10

Alias


2 Answers

I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.

I program in C so what works for me may not be applicable in your case.

Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!

You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.

When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.

In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.

I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.

I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:

inline void spinlock_init (SPINLOCK *slp)
{
  slp->lock_part=0;
}

inline char spinlock_failed (SPINLOCK *slp)
{
  return (char) __xchg (&slp->lock_part,1);
}

Or more elaborate (but not overly so):

inline char spinlock_failed (SPINLOCK *slp)
{
  if (__xchg (&slp->lock_part,1)==1) return 1;
  slp->count_part=1;
  return 0;
}

And to release

inline void spinlock_leave (SPINLOCK *slp)
{
  slp->lock_part=0;
}

Or

inline void spinlock_leave (SPINLOCK *slp)
{
  if (slp->count_part==0) __breakpoint ();
  if (--slp->count_part==0) slp->lock_part=0;
}

The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.

I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.

like image 170
Olof Forshell Avatar answered Nov 20 '22 15:11

Olof Forshell


To address your bullet points:

1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.

2) skipped

3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.

4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.

Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.

like image 21
Brian Avatar answered Nov 20 '22 16:11

Brian