Is mutex needed to synchronize a simple flag between pthreads?

Tags:

Let's imagine that I have a few worker threads such as follows:

while (1) {
    do_something();

    if (flag_isset())
        do_something_else();
}

We have a couple of helper functions for checking and setting a flag:

void flag_set()   { global_flag = 1; }
void flag_clear() { global_flag = 0; }
int  flag_isset() { return global_flag; }

Thus the threads keep calling do_something() in a busy-loop and in case some other thread sets global_flag the thread also calls do_something_else() (which could for example output progress or debugging information when requested by setting the flag from another thread).

My question is: Do I need to do something special to synchronize access to the global_flag? If yes, what exactly is the minimum work to do the synchronization in a portable way?

I have tried to figure this out by reading many articles but I am still not quite sure of the correct answer... I think it is one of the following:

A: No need to synchronize because setting or clearing the flag does not create race conditions:

We just need to define the flag as volatile to make sure that it is really read from the shared memory every time it is being checked:

volatile int global_flag;

It might not propagate to other CPU cores immediately but will sooner or later, guaranteed.

B: Full synchronization is needed to make sure that changes to the flag are propagated between threads:

Setting the shared flag in one CPU core does not necessarily make it seen by another core. We need to use a mutex to make sure that flag changes are always propagated by invalidating the corresponding cache lines on other CPUs. The code becomes as follows:

volatile int    global_flag;
pthread_mutex_t flag_mutex;

void flag_set()   { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }

int  flag_isset()
{
    int rc;
    pthread_mutex_lock(flag_mutex);
    rc = global_flag;
    pthread_mutex_unlock(flag_mutex);
    return rc;
}

C: Synchronization is needed to make sure that changes to the flag are propagated between threads:

This is the same as B but instead of using a mutex on both sides (reader & writer) we set it in only in the writing side. Because the logic does not require synchronization. we just need to synchronize (invalidate other caches) when the flag is changed:

volatile int    global_flag;
pthread_mutex_t flag_mutex;

void flag_set()   { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }

int  flag_isset() { return global_flag; }

This would avoid continuously locking and unlocking the mutex when we know that the flag is rarely changed. We are just using a side-effect of Pthreads mutexes to make sure that the change is propagated.

So, which one?

I think A and B are the obvious choices, B being safer. But how about C?

If C is ok, is there some other way of forcing the flag change to be visible on all CPUs?

There is one somewhat related question: Does guarding a variable with a pthread mutex guarantee it's also not cached? ...but it does not really answer this.

617

asked Aug 28 '11 18:08

snap

2 Answers

The 'minimum amount of work' is an explicit memory barrier. The syntax depends on your compiler; on GCC you could do:

void flag_set()   {
  global_flag = 1;
  __sync_synchronize(global_flag);
}

void flag_clear() {
  global_flag = 0;
  __sync_synchronize(global_flag);
}

int  flag_isset() {
  int val;
  // Prevent the read from migrating backwards
  __sync_synchronize(global_flag);
  val = global_flag;
  // and prevent it from being propagated forwards as well
  __sync_synchronize(global_flag);
  return val;
}

These memory barriers accomplish two important goals:

They force a compiler flush. Consider a loop like the following:
```
 for (int i = 0; i < 1000000000; i++) {
   flag_set(); // assume this is inlined
   local_counter += i;
 }
```
Without a barrier, a compiler might choose to optimize this to:
```
 for (int i = 0; i < 1000000000; i++) {
   local_counter += i;
 }
 flag_set();
```
Inserting a barrier forces the compiler to write the variable back immediately.
They force the CPU to order its writes and reads. This is not so much an issue with a single flag - most CPU architectures will eventually see a flag that's set without CPU-level barriers. However the order might change. If we have two flags, and on thread A:
```
  // start with only flag A set
  flag_set_B();
  flag_clear_A();
```
And on thread B:
```
  a = flag_isset_A();
  b = flag_isset_B();
  assert(a || b); // can be false!
```
Some CPU architectures allow these writes to be reordered; you may see both flags being false (ie, the flag A write got moved first). This can be a problem if a flag protects, say, a pointer being valid. Memory barriers force an ordering on writes to protect against these problems.

Note also that on some CPUs, it's possible to use 'acquire-release' barrier semantics to further reduce overhead. Such a distinction does not exist on x86, however, and would require inline assembly on GCC.

A good overview of what memory barriers are and why they are needed can be found in the Linux kernel documentation directory. Finally, note that this code is enough for a single flag, but if you want to synchronize against any other values as well, you must tread very carefully. A lock is usually the simplest way to do things.

135

answered Oct 13 '22 19:10

bdonlan

You must not cause data race cases. It is undefined behavior and the compiler is allowed to do anything and everything it pleases.

A humorous blog on the topic: http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Case 1: There is no synchronization on the flag, so anything is allowed to happen. For example, the compiler is allowed to turn

flag_set();
while(weArentBoredLoopingYet())
    doSomethingVeryExpensive();
flag_clear()

into

while(weArentBoredLoopingYet())
    doSomethingVeryExpensive();
flag_set();
flag_clear()

Note: this kind of race is actually very popular. Your millage may vary. One one hand, the de-facto implementation of pthread_call_once involves a data race like this. On the other hand, it is undefined behavior. On most versions of gcc, you can get away with it because gcc chooses not to exercise its right to optimize this way in many cases, but it is not "spec" code.

B: full synchronization is the right call. This is simply what you have to do.

C: Only synchronization on the writer could work, if you can prove that no one wants to read it while it is writing. The official definition of a data race (from the C++11 specification) is one thread writing to a variable while another thread can concurrently read or write the same variable. If your readers and writers all run at once, you still have a race case. However, if you can prove that the writer writes once, there is some synchronization, and then the readers all read, then the readers do not need synchronization.

As for caching, the rule is that a mutex lock/unlock synchronizes with all threads that lock/unlock the same mutex. This means you will not see any unusual caching effects (although under the hood, your processor can do spectacular things to make this run faster... it's just obliged to make it look like it wasn't doing anything special). If you don't synchronize, however, you get no guarantees that the other thread doesn't have changes to push that you need!

All of that being said, the question is really how much are you willing to rely on compiler specific behavior. If you want to write proper code, you need to do proper synchronization. If you are willing to rely on the compiler to be kind to you, you can get away with a lot less.

If you have C++11, the easy answer is to use atomic_flag, which is designed to do exactly what you want AND is designed to synchronize correctly for you in most cases.

answered Oct 13 '22 17:10

Cort Ammon

Related questions
                            
                                Why does Visual C++ warn on implicit cast from const void ** to void * in C, but not in C++?
                            
                                C vs assembler vs NEON performance
                            
                                How to emulate strongly typed enum in C?
                            
                                How to compare socket address in C?
                            
                                Valgrind stack misses a function completely
                            
                                Is there a way of compiling C11 to C89?
                            
                                Why putchar, toupper, tolower, etc. take a int instead of a char?
                            
                                How do I properly use Python's C API and exceptions?
                            
                                Optimizing linear access to arrays with pre-fetching and cache in C
                            
                                Example code to trigger Clang's static analyser
                            
                                Is scanf's "regex" support a standard?
                            
                                how to postpone a function call in c
                            
                                %n format specifier program giving different outputs on different compilers. Why?
                            
                                Array bounds checks on 64-bit hardware using hardware memory-protection
                            
                                Which file systems support splicing via Linux's splice(2)?
                            
                                What float value makes sprintf_s() produce "1.#QO"?
                            
                                Why would the size of a packed structure be different on Linux and Windows when using gcc?
                            
                                Cache-friendly copying of an array with readjustment by known index, gather, scatter
                            
                                Equivalent of CGPoint with integers?
                            
                                Cost of context switch between threads of same process, on Linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With