Does anyone know of a good/correct implementation of Peterson's Lock algorithm in C? I can't seem to find this. Thanks.
Peterson's Algorithm is used to synchronize two processes. It uses two variables, a bool array flag of size 2 and an int variable turn to accomplish it. In the solution i represents the Consumer and j represents the Producer. Initially the flags are false.
Peterson's solution is a classic solution to the critical section problem. The critical section problem ensures that no two process change or modify the value of a resource at the same time. For example, let int a=5 and there are two processes p1 and p2 that can modify the value of a.
However, in Peterson solution, A deadlock can never happen because the process which first sets the turn variable will enter in the critical section for sure. Therefore, if a process is preempted after executing line number 4 of the entry section then it will definitely get into the critical section in its next chance.
You are right, the Peterson algorithm for two threads is fair (aka first come first served).
Peterson's algorithm cannot be implemented correctly in C99, as explained in who ordered memory fences on an x86.
Peterson's algorithm is as follows:
LOCK:
interested[id] = 1 interested[other] = 1
turn = other turn = id
while turn == other while turn == id
and interested[other] == 1 and interested[id] == 1
UNLOCK:
interested[id] = 0 interested[other] = 0
There are some hidden assumptions here. To begin with, each thread must note its interest in acquiring the lock before giving away its turn. Giving away the turn must make visible to the other thread that we are interested in acquiring the lock.
Also, as in every lock, memory accesses in the critical section cannot be hoisted past the lock() call, nor sunk past the unlock(). I.e.: lock() must have at least acquire semantics, and unlock() must have at least release semantics.
In C11, the simplest way to achieve this would be to use a sequentially consistent memory order, which makes the code run as if it were a simple interleaving of threads running in program order (WARNING: totally untested code, but it's similar to an example in Dmitriy V'jukov's Relacy Race Detector):
lock(int id)
{
atomic_store(&interested[id], 1);
atomic_store(&turn, 1 - id);
while (atomic_load(&turn) == 1 - id
&& atomic_load(&interested[1 - id]) == 1);
}
unlock(int id)
{
atomic_store(&interested[id], 0);
}
This ensures that the compiler doesn't make optimizations that break the algorithm (by hoisting/sinking loads/stores across atomic operations), and emits the appropriate CPU instructions to ensure the CPU also doesn't break the algorithm. The default memory model for C11/C++11 atomic operations that don't explicitely select a memory model is the sequentially consistent memory model.
C11/C++11 also support weaker memory models, allowing as much optimization as possible. The following is a translation to C11 of the translation to C++11 by Anthony Williams of an algorithm originally by Dmitriy V'jukov in the syntax of his own Relacy Race Detector [petersons_lock_with_C++0x_atomics] [the-inscrutable-c-memory-model]. If this algorithm is incorrect it's my fault (WARNING: also untested code, but based on good code from Dmitriy V'jukov and Anthony Williams):
lock(int id)
{
atomic_store_explicit(&interested[id], 1, memory_order_relaxed);
atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel);
while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1
&& atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id);
}
unlock(int id)
{
atomic_store_explicit(&interested[id], 0, memory_order_release);
}
Notice the exchange with acquire and release semantics. An exchange is an atomic RMW operation. Atomic RMW operations always read the last value stored before the write in the RMW operation. Also, an acquire on an atomic object that reads a write from a release on that same atomic object (or any later write on that object from the thread that performed the release or any later write from any atomic RMW operation) creates a synchronizes-with relation between the release and the acquire.
So, this operation is a synchronization point between the threads, there is always a synchronizes-with relationship between the exchange in one thread and the last exchange performed by any thread (or the initialization of turn, for the very first exchange).
So we have a sequenced-before relationship between the store to interested[id]
and the exchange from/to turn
, a synchronizes-with relationship between two
consecutive exchanges from/to turn
, and a sequenced-before relationship
between the exchange from/to turn
and the load of interested[1 - id]
. This
amounts to a happens-before relationship between accesses to interested[x]
in
different threads, with turn
providing the synchronization between threads.
This forces all the ordering needed to make the algorithm work.
So how were these things done before C11? It involved using compiler and
CPU-specific magic. As an example, let's see the pretty strongly-ordered x86.
IIRC, all x86 loads have acquire semantics, and all stores have release
semantics (save non-temporal moves, in SSE, used precisely to achive higher
performance at the cost of ocassionally needing to issue CPU fences to achieve
coherence between CPUs). But this is not enough for Peterson's algorithm, as
Bartosz Milewsky explains at
who-ordered-memory-fences-on-an-x86 ,
for Peterson's algorithm to work we need to establish an ordering between
accesses to turn
and interested
, failing to do that may result in seeing loads
from interested[1 - id]
before writes to interested[id]
, which is a bad thing.
So a way to do that in GCC/x86 would be (WARNING: although I tested something similar to the following, actually a modified version of the code at wrong-implementation-of-petersons-algorithm , testing is nowhere near assuring correctness of multithreaded code):
lock(int id)
{
interested[id] = 1;
turn = 1 - id;
__asm__ __volatile__("mfence");
do {
__asm__ __volatile__("":::"memory");
} while (turn == 1 - id
&& interested[1 - id] == 1);
}
unlock(int id)
{
interested[id] = 0;
}
The MFENCE
prevents stores and loads to different memory addresses from being
reordered. Otherwise the write to interested[id]
could be queued in the store
buffer while the load of interested[1 - id]
proceeds. On many current
microarchitectures a SFENCE
may be enough, since it may be implemented as a
store buffer drain, but IIUC SFENCE
doesn't need to be implemented that way,
and may simply prevent reordering between stores. So SFENCE
may not be enough everywhere, and we need a full MFENCE
.
The compiler barrier (__asm__ __volatile__("":::"memory")
) prevents
the compiler from deciding that it already knows the value of turn
. We're
telling the compiler that we've clobbered memory, so all values cached in
registers must be reloaded from memory.
P.S: I feel this needs a closing paragraph, but my brain is drained.
I won't make any assertions about how good or correct the implementation is, but it was tested (briefly). This is a straight translation of the algorithm described on wikipedia.
struct petersonslock_t {
volatile unsigned flag[2];
volatile unsigned turn;
};
typedef struct petersonslock_t petersonslock_t;
petersonslock_t petersonslock () {
petersonslock_t l = { { 0U, 0U }, ~0U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 1;
l->turn = !p;
while (l->flag[!p] && (l->turn == !p)) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 0;
};
Greg points out that on an SMP architecture with slightly relaxed memory coherency (such as x86), although the loads to the same memory location are in order, loads to different locations on one processor may appear out of order to the other processor.
Jens Gustedt and ninjalj recommend modifying the original algorithm to use the atomic_flag
type. This means setting the flags and turns would use the atomic_flag_test_and_set
and clearing them would use atomic_flag_clear
from C11. Alternatively, a memory barrier could be imposed between updates to flag
.
Edit: I originally attempted to correct for this by writing to the same memory location for all the states. ninjalj pointed out that the bitwise operations turned the state operations into RMW rather than load and stores of the original algorithm. So, atomic bitwise operations are required. C11 provides such operators, as does GCC with built-ins. The algorithm below uses GCC built-ins, but wrapped in macros so that it can easily be changed to some other implementation. However, modifying the original algorithm above is the preferred solution.
struct petersonslock_t {
volatile unsigned state;
};
typedef struct petersonslock_t petersonslock_t;
#define ATOMIC_OR(x,v) __sync_or_and_fetch(&x, v)
#define ATOMIC_AND(x,v) __sync_and_and_fetch(&x, v)
petersonslock_t petersonslock () {
petersonslock_t l = { 0x000000U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00;
ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000);
(p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00);
while ((l->state & mask) && (l->state & 0x0000FF) == !p) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF);
};
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With