When using kernel objects to synchronize threads running on different CPUs, is there perhaps some extra runtime cost when using Windows Server 2008 R2 relative to other OS's?
Edit: And as found out via the answer, the question should also include the phrase, "when running at lower CPU utilization levels." I included more information in my own answer to this question.
I work on a product that uses shared memory and semaphores for communication between processes (when the two processes are running on the same machine). Reports of performance problems on Windows Server 2008 R2 (which I shorten to Win2008R2 after this) led me to find that sharing a semaphore between two threads on Win2008R2 was relatively slow compared to other OS’s.
I was able to reproduce it by running the following bit of code concurrently on two threads:
for ( i = 0; i < N; i++ )
{
WaitForSingleObject( globalSem, INFINITE );
ReleaseSemaphore( globalSem, 1, NULL );
}
Testing with a machine that would dual boot into Windows Server 2003 R2 SP2 and Windows Server 2008 R2, the above snippet would run about 7 times faster on the Win2003R2 machine versus the Win2008R2 (3 seconds for Win2003R2 and 21 seconds for Win2008R2).
The following is the full version of the aforementioned test:
#include <windows.h>
#include <stdio.h>
#include <time.h>
HANDLE gSema4;
int gIterations = 1000000;
DWORD WINAPI testthread( LPVOID tn )
{
int count = gIterations;
while ( count-- )
{
WaitForSingleObject( gSema4, INFINITE );
ReleaseSemaphore( gSema4, 1, NULL );
}
return 0;
}
int main( int argc, char* argv[] )
{
DWORD threadId;
clock_t ct;
HANDLE threads[2];
gSema4 = CreateSemaphore( NULL, 1, 1, NULL );
ct = clock();
threads[0] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
threads[1] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
WaitForMultipleObjects( 2, threads, TRUE, INFINITE );
printf( "Total time = %d\n", clock() - ct );
CloseHandle( gSema4 );
return 0;
}
I updated the test to enforce the threads to run a single iteration and force a switch to the next thread at each loop. Each thread signals the next thread to run at the end of each loop (round-robin style). And I also updated it to use a spinlock as an alternative to the semaphore (which is a kernel object).
All machines I tested on were 64-bit machines. I compiled the test mostly as 32-bit. If built as 64-bit, it ran a bit faster overall and changed the ratios some, but the final result was the same. In addition to Win2008R2, I also ran against Windows 7 Enterprise SP 1, Windows Server 2003 R2 Standard SP 2, Windows Server 2008 (not R2), and Windows Server 2012 Standard.
Here are some actual numbers from the updated test (times are in milliseconds):
+----------------+-----------+---------------+----------------+
| OS | 2 cpu sem | 1 cpu sem | 2 cpu spinlock |
+----------------+-----------+---------------+----------------+
| Windows 7 | 7115 ms | 1960 ms (3.6) | 504 ms (14.1) |
| Server 2008 R2 | 20640 ms | 2263 ms (9.1) | 866 ms (23.8) |
| Server 2003 | 3570 ms | 1766 ms (2.0) | 452 ms (7.9) |
+----------------+-----------+---------------+----------------+
Each of the 2 threads in the test ran 1 million iterations. Those testes were all run on identical machines. The Win Server 2008 and Server 2003 numbers are from a dual boot machine. The Win 7 machine has the exact same specs but was a different physical machine. The machine in this case is a Lenovo T420 laptop with Core i5-2520M 2.5GHz. Obviously not a server class machine, but I get similar result on true server class hardware. The numbers in parentheses are the ratio of the first column to the given column.
Any explanation for why this one OS would seem to introduce extra expense for kernel level synchronization across CPUs? Or do you know of some configuration/tuning parameter that might affect this?
While it would make this exceedingly verbose and long post longer, I could post the enhanced version of the test code that the above numbers came from if anyone wants it. That would show the enforcement of the round-robin logic and the spinlock version of the test.
To try to answer some of the inevitable questions about why things are done this way. And I'm the same ... when I read a post, I often wonder why I am even asking. So here are some attempts clarify:
Thread synchronization is the concurrent execution of two or more threads that share critical resources. Threads should be synchronized to avoid critical resource use conflicts. Otherwise, conflicts may arise when parallel-running threads attempt to modify a common variable at the same time.
Synchronization can result in hold-wait deadlock where two threads each have the lock of an object, and are trying to acquire the lock of the other thread's object. Synchronization must also be global for a class, and an easy mistake to make is to forget to synchronize a method.
Thread synchronization is defined as a mechanism which ensures that two or more concurrent processes or threads do not simultaneously execute some particular program segment known as critical section. Processes' access to critical section is controlled by using synchronization techniques.
Pulled from the comments into an answer:
Maybe the server is not set to the high-performance power plan? Win2k8 might have a different default. Many servers aren't by default, and this hits performance very hard.
The OP confirmed this as the root cause.
This is a funny cause for this behavior. The idea flashed up in my head while I was doing something completely different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With