Does anyone know of a good/correct implementation of Peterson's Lock algorithm in C? I can't seem to find this. Thanks.

I won't make any assertions about how good or correct the implementation is, but it was tested (briefly). This is a straight translation of the algorithm described on wikipedia. <pre class="prettyprint"><code>struct petersonslock_t { volatile unsigned flag[2]; volatile unsigned turn; }; typedef struct petersonslock_t petersonslock_t; petersonslock_t petersonslock () { petersonslock_t l = { { 0U, 0U }, ~0U }; return l; } void petersonslock_lock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); l->flag[p] = 1; l->turn = !p; while (l->flag[!p] && (l->turn == !p)) {} }; void petersonslock_unlock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); l->flag[p] = 0; }; </code></pre> Greg points out that on an SMP architecture with slightly relaxed memory coherency (such as x86), although the loads to the same memory location are in order, loads to different locations on one processor may appear out of order to the other processor. Jens Gustedt and ninjalj recommend modifying the original algorithm to use the <code>atomic_flag</code> type. This means setting the flags and turns would use the <code>atomic_flag_test_and_set</code> and clearing them would use <code>atomic_flag_clear</code> from C11. Alternatively, a memory barrier could be imposed between updates to <code>flag</code>. Edit: I originally attempted to correct for this by writing to the same memory location for all the states. ninjalj pointed out that the bitwise operations turned the state operations into RMW rather than load and stores of the original algorithm. So, atomic bitwise operations are required. C11 provides such operators, as does GCC with built-ins. The algorithm below uses GCC built-ins, but wrapped in macros so that it can easily be changed to some other implementation. However, modifying the original algorithm above is the preferred solution. <pre class="prettyprint"><code>struct petersonslock_t { volatile unsigned state; }; typedef struct petersonslock_t petersonslock_t; #define ATOMIC_OR(x,v) __sync_or_and_fetch(&x, v) #define ATOMIC_AND(x,v) __sync_and_and_fetch(&x, v) petersonslock_t petersonslock () { petersonslock_t l = { 0x000000U }; return l; } void petersonslock_lock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00; ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000); (p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00); while ((l->state & mask) && (l->state & 0x0000FF) == !p) {} }; void petersonslock_unlock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF); }; </code></pre>

A tested implementation of Peterson Lock algorithm?

2 Answers

Peterson's algorithm cannot be implemented correctly in C99, as explained in who ordered memory fences on an x86.

Peterson's algorithm is as follows:

LOCK:

interested[id] = 1                interested[other] = 1
turn = other                      turn = id

while turn == other               while turn == id
  and interested[other] == 1        and interested[id] == 1


UNLOCK:

interested[id] = 0                interested[other] = 0

There are some hidden assumptions here. To begin with, each thread must note its interest in acquiring the lock before giving away its turn. Giving away the turn must make visible to the other thread that we are interested in acquiring the lock.

Also, as in every lock, memory accesses in the critical section cannot be hoisted past the lock() call, nor sunk past the unlock(). I.e.: lock() must have at least acquire semantics, and unlock() must have at least release semantics.

In C11, the simplest way to achieve this would be to use a sequentially consistent memory order, which makes the code run as if it were a simple interleaving of threads running in program order (WARNING: totally untested code, but it's similar to an example in Dmitriy V'jukov's Relacy Race Detector):

lock(int id)
{
    atomic_store(&interested[id], 1);
    atomic_store(&turn, 1 - id);

    while (atomic_load(&turn) == 1 - id
           && atomic_load(&interested[1 - id]) == 1);
}

unlock(int id)
{
    atomic_store(&interested[id], 0);
}

This ensures that the compiler doesn't make optimizations that break the algorithm (by hoisting/sinking loads/stores across atomic operations), and emits the appropriate CPU instructions to ensure the CPU also doesn't break the algorithm. The default memory model for C11/C++11 atomic operations that don't explicitely select a memory model is the sequentially consistent memory model.

C11/C++11 also support weaker memory models, allowing as much optimization as possible. The following is a translation to C11 of the translation to C++11 by Anthony Williams of an algorithm originally by Dmitriy V'jukov in the syntax of his own Relacy Race Detector [petersons_lock_with_C++0x_atomics] [the-inscrutable-c-memory-model]. If this algorithm is incorrect it's my fault (WARNING: also untested code, but based on good code from Dmitriy V'jukov and Anthony Williams):

lock(int id)
{
    atomic_store_explicit(&interested[id], 1, memory_order_relaxed);
    atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel);

    while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1
           && atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id);
}

unlock(int id)
{
    atomic_store_explicit(&interested[id], 0, memory_order_release);
}

Notice the exchange with acquire and release semantics. An exchange is an atomic RMW operation. Atomic RMW operations always read the last value stored before the write in the RMW operation. Also, an acquire on an atomic object that reads a write from a release on that same atomic object (or any later write on that object from the thread that performed the release or any later write from any atomic RMW operation) creates a synchronizes-with relation between the release and the acquire.

So, this operation is a synchronization point between the threads, there is always a synchronizes-with relationship between the exchange in one thread and the last exchange performed by any thread (or the initialization of turn, for the very first exchange).

So we have a sequenced-before relationship between the store to interested[id] and the exchange from/to turn, a synchronizes-with relationship between two consecutive exchanges from/to turn, and a sequenced-before relationship between the exchange from/to turn and the load of interested[1 - id]. This amounts to a happens-before relationship between accesses to interested[x] in different threads, with turn providing the synchronization between threads. This forces all the ordering needed to make the algorithm work.

So how were these things done before C11? It involved using compiler and CPU-specific magic. As an example, let's see the pretty strongly-ordered x86. IIRC, all x86 loads have acquire semantics, and all stores have release semantics (save non-temporal moves, in SSE, used precisely to achive higher performance at the cost of ocassionally needing to issue CPU fences to achieve coherence between CPUs). But this is not enough for Peterson's algorithm, as Bartosz Milewsky explains at who-ordered-memory-fences-on-an-x86 , for Peterson's algorithm to work we need to establish an ordering between accesses to turn and interested, failing to do that may result in seeing loads from interested[1 - id] before writes to interested[id], which is a bad thing.

So a way to do that in GCC/x86 would be (WARNING: although I tested something similar to the following, actually a modified version of the code at wrong-implementation-of-petersons-algorithm , testing is nowhere near assuring correctness of multithreaded code):

lock(int id)
{
    interested[id] = 1;
    turn = 1 - id;
    __asm__ __volatile__("mfence");

    do {
        __asm__ __volatile__("":::"memory");
    } while (turn == 1 - id
           && interested[1 - id] == 1);
}

unlock(int id)
{
   interested[id] = 0;
}

The MFENCE prevents stores and loads to different memory addresses from being reordered. Otherwise the write to interested[id] could be queued in the store buffer while the load of interested[1 - id] proceeds. On many current microarchitectures a SFENCE may be enough, since it may be implemented as a store buffer drain, but IIUC SFENCE doesn't need to be implemented that way, and may simply prevent reordering between stores. So SFENCE may not be enough everywhere, and we need a full MFENCE.

The compiler barrier (__asm__ __volatile__("":::"memory")) prevents the compiler from deciding that it already knows the value of turn. We're telling the compiler that we've clobbered memory, so all values cached in registers must be reloaded from memory.

P.S: I feel this needs a closing paragraph, but my brain is drained.

answered Oct 22 '22 15:10

ninjalj

I won't make any assertions about how good or correct the implementation is, but it was tested (briefly). This is a straight translation of the algorithm described on wikipedia.

struct petersonslock_t {
    volatile unsigned flag[2];
    volatile unsigned turn;
};
typedef struct petersonslock_t petersonslock_t;

petersonslock_t petersonslock () {
    petersonslock_t l = { { 0U, 0U }, ~0U };
    return l;
}

void petersonslock_lock (petersonslock_t *l, int p) {
    assert(p == 0 || p == 1);
    l->flag[p] = 1;
    l->turn = !p;
    while (l->flag[!p] && (l->turn == !p)) {}
};

void petersonslock_unlock (petersonslock_t *l, int p) {
    assert(p == 0 || p == 1);
    l->flag[p] = 0;
};

Greg points out that on an SMP architecture with slightly relaxed memory coherency (such as x86), although the loads to the same memory location are in order, loads to different locations on one processor may appear out of order to the other processor.

Jens Gustedt and ninjalj recommend modifying the original algorithm to use the atomic_flag type. This means setting the flags and turns would use the atomic_flag_test_and_set and clearing them would use atomic_flag_clear from C11. Alternatively, a memory barrier could be imposed between updates to flag.

Edit: I originally attempted to correct for this by writing to the same memory location for all the states. ninjalj pointed out that the bitwise operations turned the state operations into RMW rather than load and stores of the original algorithm. So, atomic bitwise operations are required. C11 provides such operators, as does GCC with built-ins. The algorithm below uses GCC built-ins, but wrapped in macros so that it can easily be changed to some other implementation. However, modifying the original algorithm above is the preferred solution.

struct petersonslock_t {
    volatile unsigned state;
};
typedef struct petersonslock_t petersonslock_t;

#define ATOMIC_OR(x,v)   __sync_or_and_fetch(&x, v)
#define ATOMIC_AND(x,v)  __sync_and_and_fetch(&x, v)

petersonslock_t petersonslock () {
    petersonslock_t l = { 0x000000U };
    return l;
}

void petersonslock_lock (petersonslock_t *l, int p) {
    assert(p == 0 || p == 1);
    unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00;
    ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000);
    (p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00);
    while ((l->state & mask) && (l->state & 0x0000FF) == !p) {}
};

void petersonslock_unlock (petersonslock_t *l, int p) {
    assert(p == 0 || p == 1);
    ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF);
};

answered Oct 22 '22 15:10

jxh

Related questions
                            
                                Memory-optimizing recursive calls in C
                            
                                tracking uninitialized static variables
                            
                                Can we typecast a enum variable In c
                            
                                How to make external Mathematica functions interruptible?
                            
                                Faster than scanf?
                            
                                How do I disable buffering in fread()?
                            
                                Suppress C warning messages in R
                            
                                Declaring parameters outside the declarator
                            
                                Handling null pointers on AIX with GCC C
                            
                                Executing C program in Linux Terminal
                            
                                Populate C output parameter
                            
                                Get all env variables in C\C++ on Windows
                            
                                How is function main() inside a shared object (.so) taken care of by the linker?
                            
                                How can I reduce the memory footprint of a minimal Linux process
                            
                                Inner working of the C standard library
                            
                                In C and Objective-C, should we use 0.5f or 0.5?
                            
                                Modern Ada to C/C++ translator [closed]
                            
                                How do I learn to write a console emulator? [duplicate]
                            
                                Exporting the sqlite3 table into file using "C" execute API
                            
                                how to duplicate pointer in order to point on the same object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

A tested implementation of Peterson Lock algorithm?

Tags:

c

concurrency

locking

c99

Dervin Thunk

People also ask

2 Answers

ninjalj

jxh

Recent Activity

Donate For Us