We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?
I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.
To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With