A readers/writer lock... without having a lock for the readers?

Question

I get the feeling this may be a very general and common situation for which a well-known no-lock solution exists.

In a nutshell, I'm hoping there's approach like a readers/writer lock, but that doesn't require the readers to acquire a lock and thus can be better average performance.

Instead there'd be some atomic operations (128-bit CAS) for a reader, and a mutex for a writer. I'd have two copies of the data structure, a read-only one for the normally-successful queries, and an identical copy to be update under mutex protection. Once the data has been inserted into the writable copy, we make it the new readable copy. The old readable copy then gets inserted in turn, once all the pending readers have finished reading it, and the writer spins on the number of readers left until its zero, then modifies it in turn, and finally releases the mutex.

Or something like that.

Anything along these lines exist?

Peter Cordes · Accepted Answer

If your data fits in a 64-bit value, most systems can cheaply read/write that atomically, so just use std::atomic<my_struct>.

For smallish and/or infrequently-written data, there are a couple ways to make readers truly read-only on the shared data, not having to do any atomic RMW operations on a shared counter or anything. This allows read-side scaling to many threads without readers contending with each other (unlike a 128-bit atomic read on x86 using lock cmpxchg16b¹, or taking a RWlock).

Ideally just an extra level of indirection via an atomic<T*> pointer (RCU), or just an extra load + compare-and-branch (SeqLock); no atomic RMWs or memory barriers stronger than acq/rel or anything else in the read side.

This can be appropriate for data that's read very frequently by many threads, e.g. a timestamp updated by a timer interrupt but read all over the place. Or a config setting that typically never changes.

If your data is larger and/or changes more frequently, one of the strategies suggested in other answers that requires a reader to still take a RWlock on something or atomically increment a counter will be more appropriate. This won't scale perfectly because each reader still needs to get exclusive ownership of the shared cache line containing lock or counter so it can modify it, but there's no such thing as a free lunch.

Note 1: Update: x86 vendors finally decided to guarantee that 128-bit SSE/AVX loads / stores are atomic on CPUs with AVX, so if you're lucky std::atomic<16-byte-struct> has cheap loads when running on a CPU with AVX enabled. e.g. not Pentium/Celeron before Ice Lake. GCC for a while has been indirecting to a libgcc atomic_load_16 function for 16-byte operations, so runtime dispatching for it can pick a lock cmpxchg16b strategy on CPUs that support it. Now it has a much better option to choose from on some CPUs.

RCU

It sounds like you're half-way to inventing RCU (Read Copy Update) where you update a pointer to the new version.

But remember a lock-free reader might stall after loading the pointer, so you have a deallocation problem. This is the hard part of RCU. In a kernel it can be solved by having sync points where you know that there are no readers older than some time t, and thus can free old versions. There are some user-space implementations. https://en.wikipedia.org/wiki/Read-copy-update and https://lwn.net/Articles/262464/.

For RCU, the less frequent the changes, the larger a data structure you can justify copying. e.g. even a moderate-sized tree could be doable if it's only ever changed interactively by an admin, while readers are running on dozens of cores all checking something in parallel. e.g. kernel config settings are one thing where RCU is great in Linux.

SeqLock

If your data is small (e.g. a 64-bit timestamp on a 32-bit machine), another good option is a SeqLock. Readers check a sequence counter before/after non-atomic copy of the data into a private buffer. If the sequence counters match, we know there wasn't tearing. (Writers mutually exclude each with a separate mutex). Implementing 64 bit atomic counter with 32 bit atomics / how to implement a seqlock lock using c++11 atomic library.

It's a bit of a hack in C++ to write something that can compile efficiently to a non-atomic copy that might have tearing, because inevitably that's data-race UB. (Unless you use std::atomic<long> with mo_relaxed for each chunk separately, but then you're defeating the compiler from using movdqu or something to copy 16 bytes at once.)

A SeqLock makes the reader copy the whole thing (or ideally just load it into registers) every read so it's only ever appropriate for a small struct or 128-bit integer or something. But for less than 64 bytes of data it can be quite good, better than having readers use lock cmpxchg16b for a 128-bit datum if you have many readers and infrequent writes.

It's not lock-free, though: a writer that sleeps while modifying the SeqLock could get readers stuck retrying indefinitely. For a small SeqLock the window is small, and obviously you want to have all the data ready before you do the first sequence-counter update to minimize the chance for an interrupt pausing the writer in mid update.

The best case is when there's only 1 writer so it doesn't have to do any locking; it knows nothing else will be modifying the sequence counter.

A readers/writer lock... without having a lock for the readers?

Tags:

c++

concurrency

stdatomic

lock-free

lockless

Swiss Frank

1 Answers

RCU

SeqLock

Peter Cordes

Recent Activity

Donate For Us

A readers/writer lock... without having a lock for the readers?

Tags:

c++

concurrency

stdatomic

lock-free

lockless

Swiss Frank

1 Answers

RCU

SeqLock

Peter Cordes

Related questions

Recent Activity

Donate For Us