Looking for an explanation for thread synchronization performance issue

Background

I work on a product that uses shared memory and semaphores for communication between processes (when the two processes are running on the same machine). Reports of performance problems on Windows Server 2008 R2 (which I shorten to Win2008R2 after this) led me to find that sharing a semaphore between two threads on Win2008R2 was relatively slow compared to other OS’s.

Reproducing it

I was able to reproduce it by running the following bit of code concurrently on two threads:

for ( i = 0; i < N; i++ )
  {
  WaitForSingleObject( globalSem, INFINITE );
  ReleaseSemaphore( globalSem, 1, NULL );
  }

Testing with a machine that would dual boot into Windows Server 2003 R2 SP2 and Windows Server 2008 R2, the above snippet would run about 7 times faster on the Win2003R2 machine versus the Win2008R2 (3 seconds for Win2003R2 and 21 seconds for Win2008R2).

Simple Version of the Test

The following is the full version of the aforementioned test:

#include <windows.h>
#include <stdio.h>
#include <time.h>


HANDLE gSema4;
int    gIterations = 1000000;

DWORD WINAPI testthread( LPVOID tn )
{
   int count = gIterations;

   while ( count-- )
      {
      WaitForSingleObject( gSema4, INFINITE );
      ReleaseSemaphore( gSema4, 1, NULL );
      }

   return 0;
}


int main( int argc, char* argv[] )
{
   DWORD    threadId;
   clock_t  ct;
   HANDLE   threads[2];

   gSema4 = CreateSemaphore( NULL, 1, 1, NULL );

   ct = clock();
   threads[0] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
   threads[1] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );

   WaitForMultipleObjects( 2, threads, TRUE, INFINITE );

   printf( "Total time = %d\n", clock() - ct );

   CloseHandle( gSema4 );
   return 0;
}

More Details

I updated the test to enforce the threads to run a single iteration and force a switch to the next thread at each loop. Each thread signals the next thread to run at the end of each loop (round-robin style). And I also updated it to use a spinlock as an alternative to the semaphore (which is a kernel object).

All machines I tested on were 64-bit machines. I compiled the test mostly as 32-bit. If built as 64-bit, it ran a bit faster overall and changed the ratios some, but the final result was the same. In addition to Win2008R2, I also ran against Windows 7 Enterprise SP 1, Windows Server 2003 R2 Standard SP 2, Windows Server 2008 (not R2), and Windows Server 2012 Standard.

Running the test on a single CPU was significantly faster (“forced” by setting thread affinity with SetThreadAffinityMask and checked with GetCurrentProcessorNumber). Not surprisingly, it was faster on all OS’s when using a single CPU, but the ratio between multi-cpu and single cpu with the kernel object synchronization was much higher on Win2008R2. The typical ratio for all machines except Win2008R2 was 2x to 4x (running on multiple CPUs took 2 to 4 times longer). But on Win2008R2, the ratio was 9x.
However ... I was not able to reproduce the slowdown on all Win2008R2 machines. I tested on 4, and it showed up on 3 of them. So I cannot help but wonder if there is some kind of configuration setting or performance tuning option that might affect this. I have read performance tuning guides, looked through various settings, and changed various settings (e.g., background service vs foreground app) with no difference in behavior.
It does not seem to be necessarily tied to switching between physical cores. I originally suspected that it was somehow tied to the cost of accessing global data on different cores repeatedly. But when running a version of the test that uses a simple spinlock for synchronization (not a kernel object), running the individual threads on different CPUs was reasonably fast on all OS types. The ratio of the multi-cpu semaphore sync test vs multi-cpu spinlock test was typically 10x to 15x. But for the Win2008R2 Standard Edition machines, the ratio was 30x.

Here are some actual numbers from the updated test (times are in milliseconds):

+----------------+-----------+---------------+----------------+
|       OS       | 2 cpu sem |   1 cpu sem   | 2 cpu spinlock |
+----------------+-----------+---------------+----------------+
| Windows 7      | 7115 ms   | 1960 ms (3.6) | 504 ms (14.1)  |
| Server 2008 R2 | 20640 ms  | 2263 ms (9.1) | 866 ms (23.8)  |
| Server 2003    | 3570 ms   | 1766 ms (2.0) | 452 ms (7.9)   |
+----------------+-----------+---------------+----------------+

Each of the 2 threads in the test ran 1 million iterations. Those testes were all run on identical machines. The Win Server 2008 and Server 2003 numbers are from a dual boot machine. The Win 7 machine has the exact same specs but was a different physical machine. The machine in this case is a Lenovo T420 laptop with Core i5-2520M 2.5GHz. Obviously not a server class machine, but I get similar result on true server class hardware. The numbers in parentheses are the ratio of the first column to the given column.

Any explanation for why this one OS would seem to introduce extra expense for kernel level synchronization across CPUs? Or do you know of some configuration/tuning parameter that might affect this?

While it would make this exceedingly verbose and long post longer, I could post the enhanced version of the test code that the above numbers came from if anyone wants it. That would show the enforcement of the round-robin logic and the spinlock version of the test.

Extended Background

To try to answer some of the inevitable questions about why things are done this way. And I'm the same ... when I read a post, I often wonder why I am even asking. So here are some attempts clarify:

What is the application? It is a database server. In some situations, customers run the client application on the same machine as the server. In that case, it is faster to use shared memory for communications (versus sockets). This question is related to the shared memory comm.
Is the workload really that dependent on events? Well ... the shared memory comm is implemented using named semaphores. The client signals a semaphore, the server reads the data, the server signals a semaphore for the client when the response is ready. In other platforms, it is blinding fast. On Win2008R2, it is not. It is also very dependent on the customer application. If they write it with lots of small requests to the server, then there is a lot of communication between the two processes.
Can a lightweight lock be used? Possibly. I am already looking at that. But it is independent of the original question.

250

asked Jan 18 '13 17:01

Mark Wilkins

1 Answers

Pulled from the comments into an answer:

Maybe the server is not set to the high-performance power plan? Win2k8 might have a different default. Many servers aren't by default, and this hits performance very hard.

The OP confirmed this as the root cause.

This is a funny cause for this behavior. The idea flashed up in my head while I was doing something completely different.

answered Oct 25 '22 10:10

usr

Related questions
                            
                                Determine if a string is an integer or a float in ANSI C
                            
                                C->C++ Automatically cast void pointer into Type pointer in C++ in #define in case of type is not given (C-style) [MSVS]
                            
                                How to define NULL using #define
                            
                                Avoid duplicating code
                            
                                How can malloc() cause a SIGSEGV?
                            
                                Will .NET take over C/C++ any time? [closed]
                            
                                Using XOR operator for finding duplicate elements in a array fails in many cases
                            
                                What was the most dangerous programming mistake you have made in C?
                            
                                clang c11 threads.h not found
                            
                                Cross-compiling from Linux to Windows with Clang
                            
                                How to build and deploy a Linux driver?
                            
                                Unbuffered subprocess stdout on windows
                            
                                What's the purpose of the casts to signed int in glibc memmove?
                            
                                Where is documentation on the embedding API for the Ruby interpreter? [closed]
                            
                                How to compare multibyte characters in C
                            
                                Examples of code that compiles but executes differently in C versus C++ [closed]
                            
                                Does POSIX guarantee signals will not be delivered to a partially-initialized thread?
                            
                                ctags multi-line C function prototypes
                            
                                How to get system cpu/ram usage in c++ on Windows [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Looking for an explanation for thread synchronization performance issue

Tags:

performance

c

synchronization

multithreading

windows-server-2008