Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pthread mutex vs atomic ops in Solaris

I was doing some tests with a simple program measuring the performance of a simple atomic increment on a 64 bit value using an atomic_add_64 vs a mutex lock approach. What is puzzling me is the atomic_add is slower than the mutex lock by a factor of 2.

EDIT!!! I've done some more testing. Looks like atomics are faster than mutex and scale up to 8 concurrent threads. After that the performance of atomics degrades significantly.

The platform I've tested is:

SunOS 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V490

CC: Sun C++ 5.9 SunOS_sparc Patch 124863-03 2008/03/12

The program is quite simple:

#include <stdio.h>
#include <stdint.h>
#include <pthread.h>
#include <atomic.h>

uint64_t        g_Loops = 1000000;
volatile uint64_t       g_Counter = 0;
volatile uint32_t       g_Threads = 20;

pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t  g_Condition;

void LockMutex() 
{ 
  pthread_mutex_lock(&g_Mutex); 
}

void UnlockMutex() 
{ 
   pthread_mutex_unlock(&g_Mutex); 
}

void InitCond()
{
   pthread_mutex_init(&g_CondMutex, 0);
   pthread_cond_init(&g_Condition, 0);
}

void SignalThreadEnded()
{
   pthread_mutex_lock(&g_CondMutex);
   --g_Threads;
   pthread_cond_signal(&g_Condition);
   pthread_mutex_unlock(&g_CondMutex);
}

void* ThreadFuncMutex(void* arg)
{
   uint64_t counter = g_Loops;
   while(counter--)
   {
      LockMutex();
      ++g_Counter;
      UnlockMutex();
   }
   SignalThreadEnded();
   return 0;
}

void* ThreadFuncAtomic(void* arg)
{
   uint64_t counter = g_Loops;
   while(counter--)
   {
      atomic_add_64(&g_Counter, 1);
   }
   SignalThreadEnded();
   return 0;
}


int main(int argc, char** argv)
{
   pthread_mutex_init(&g_Mutex, 0);
   InitCond();
   bool bMutexRun = true;
   if(argc > 1)
   {
      bMutexRun = false;
      printf("Atomic run!\n");
   }
   else
        printf("Mutex run!\n");

   // start threads
   uint32_t threads = g_Threads;
   while(threads--)
   {
      pthread_t thr;
      if(bMutexRun)
         pthread_create(&thr, 0,ThreadFuncMutex, 0);
      else
         pthread_create(&thr, 0,ThreadFuncAtomic, 0);
   }
   pthread_mutex_lock(&g_CondMutex);
   while(g_Threads)
   {
      pthread_cond_wait(&g_Condition, &g_CondMutex);
      printf("Threads to go %d\n", g_Threads);
   }
   printf("DONE! g_Counter=%ld\n", (long)g_Counter);
}

The results of a test run on our box is:

$ CC -o atomictest atomictest.C
$ time ./atomictest
Mutex run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000

real    0m15.684s
user    0m52.748s
sys     0m0.396s

$ time ./atomictest 1
Atomic run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000

real    0m24.442s
user    3m14.496s
sys     0m0.068s

Did you run into this type of performance difference on Solaris? Any ideas why this happens?

On Linux the same code (using the gcc __sync_fetch_and_add) yields a 5-fold performance improvement over the mutex verstion.

Thanks, Octav

like image 288
Octav Chiriac Avatar asked Oct 09 '22 02:10

Octav Chiriac


1 Answers

You have to be careful what is happening here.

  1. It takes significant time to create a thread. Thus, its likely that not all the threads are executing simultaneously. As evidence, I took your code and removed the mutex lock and got the correct answer every time I ran it. This means that none of the threads were executing at the same time! You should not count the time to create/destruct threads in your test. You should wait till all threads are created and running before you start the test.

  2. Your test isn't fair. Your test has artificially very high lock contention. For whatever reason, the atomic add_and_fetch suffers in that situation. In real life, you would do some work in the thread. Once you add even a little bit of work, the atomic ops perform a lot better. This is because the chance of a race condition has dropped significantly. The atomic op has lower overhead when there is no contention. The mutex has more overhead than the atomic op when there is no contention.

  3. Number of threads. The fewer threads running, the lower the contention. This is why fewer threads do better for the atomic in this test. Your 8 thread number might be the number of simultaneous threads your system supports. It might not be because your test was so skewed towards contention. It would seem to me that your test would scale to the number of simultaneous threads allowed and then plateau. One thing I cannot figure out is why, when the # of threads gets higher than the number of simultaneous threads the system can handle, we don't see evidence of the situation where the mutex is left locked while the thread sleeps. Maybe we do, I just can't see it happening.

Bottom line, the atomics are a lot faster in most real life situations. They are not very good when you have to hold a lock for a long time...something you should avoid anyway (well in my opinion at least!)

I changed your code so you can test with no work, barely any work, and a little more work as well as change the # of threads.

6sm = 6 threads, barely any work, mutex 6s = 6 threads, barely any work, atomic

use a capitol S to get more work, and no s to get no work.

These results show that with 10 threads, the amount of work affects how much faster atomics are. In the first case, there is no work, and the atomics are barely faster. Add a little work and the gap doubles to 6 sec, and a lot of work and it almost gets to 10 sec.

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=10000000 13.6520 s
MUTEX  FAST g_Counter=10000000 15.2760 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10s; a.out $t ; a.out "$t"m
ATOMIC slow g_Counter=10000000 11.4957 s
MUTEX  slow g_Counter=10000000 17.9419 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=10000000 14.7108 s
MUTEX  SLOW g_Counter=10000000 23.8762 s

20 threads, atomics still better, but by a smaller margin. No work, they are almost the same speed. With a lot of work, atomics take the lead again.

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=20; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=20000000 27.6267 s
MUTEX  FAST g_Counter=20000000 30.5569 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=20S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=20000000 35.3514 s
MUTEX  SLOW g_Counter=20000000 48.7594 s

2 threads. Atomics dominate.

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=2S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=2000000 0.6007 s
MUTEX  SLOW g_Counter=2000000 1.4966 s

Here is the code (redhat linux, using gcc atomics):

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <pthread.h>

volatile uint64_t __attribute__((aligned (64))) g_Loops = 1000000 ;
volatile uint64_t __attribute__((aligned (64))) g_Counter = 0;
volatile uint32_t __attribute__((aligned (64))) g_Threads = 7; 
volatile uint32_t __attribute__((aligned (64))) g_Active = 0;
volatile uint32_t __attribute__((aligned (64))) g_fGo = 0;
int g_fSlow = 0;

#define true 1
#define false 0
#define NANOSEC(t) (1000000000ULL * (t).tv_sec + (t).tv_nsec)

pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t  g_Condition;

void LockMutex() 
{ 
  pthread_mutex_lock(&g_Mutex); 
}

void UnlockMutex() 
{ 
   pthread_mutex_unlock(&g_Mutex); 
}

void Start(struct timespec *pT)
{
   int cActive = __sync_add_and_fetch(&g_Active, 1);
   while(!g_fGo) {} 
   clock_gettime(CLOCK_THREAD_CPUTIME_ID, pT);
}

uint64_t End(struct timespec *pT)
{
   struct timespec T;
   int cActive = __sync_sub_and_fetch(&g_Active, 1);
   clock_gettime(CLOCK_THREAD_CPUTIME_ID, &T);
   return NANOSEC(T) - NANOSEC(*pT);
}
void Work(double *x, double z)
{
      *x += z;
      *x /= 27.6;
      if ((uint64_t)(*x + .5) - (uint64_t)*x != 0)
        *x += .7;
}
void* ThreadFuncMutex(void* arg)
{
   struct timespec T;
   uint64_t counter = g_Loops;
   double x = 0, z = 0;
   int fSlow = g_fSlow;

   Start(&T);
   if (!fSlow) {
     while(counter--) {
        LockMutex();
        ++g_Counter;
        UnlockMutex();
     }
   } else {
     while(counter--) {
        if (fSlow==2) Work(&x, z);
        LockMutex();
        ++g_Counter;
        z = g_Counter;
        UnlockMutex();
     }
   }
   *(uint64_t*)arg = End(&T);
   return (void*)(int)x;
}

void* ThreadFuncAtomic(void* arg)
{
   struct timespec T;
   uint64_t counter = g_Loops;
   double x = 0, z = 0;
   int fSlow = g_fSlow;

   Start(&T);
   if (!fSlow) {
     while(counter--) {
        __sync_add_and_fetch(&g_Counter, 1);
     }
   } else {
     while(counter--) {
        if (fSlow==2) Work(&x, z);
        z = __sync_add_and_fetch(&g_Counter, 1);
     }
   }
   *(uint64_t*)arg = End(&T);
   return (void*)(int)x;
}


int main(int argc, char** argv)
{
   int i;
   int bMutexRun = strchr(argv[1], 'm') != NULL;
   pthread_t thr[1000];
   uint64_t aT[1000];
   g_Threads = atoi(argv[1]);
   g_fSlow = (strchr(argv[1], 's') != NULL) ? 1 : ((strchr(argv[1], 'S') != NULL) ? 2 : 0);

   // start threads
   pthread_mutex_init(&g_Mutex, 0);
   for (i=0 ; i<g_Threads ; ++i)
         pthread_create(&thr[i], 0, (bMutexRun) ? ThreadFuncMutex : ThreadFuncAtomic, &aT[i]);

   // wait
   while (g_Active != g_Threads) {}
   g_fGo = 1;
   while (g_Active != 0) {}

   uint64_t nTot = 0;
   for (i=0 ; i<g_Threads ; ++i)
   { 
        pthread_join(thr[i], NULL);
        nTot += aT[i];
   }
   // done 
   printf("%s %s g_Counter=%llu %2.4lf s\n", (bMutexRun) ? "MUTEX " : "ATOMIC", 
    (g_fSlow == 2) ? "SLOW" : ((g_fSlow == 1) ? "slow" : "FAST"), g_Counter, (double)nTot/1e9);
}
like image 92
johnnycrash Avatar answered Oct 13 '22 00:10

johnnycrash