Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a group of statements atomic without memory visibility effects?

synchronized blocks let me make a group of statements atomic, while ensuring there is a happens-before relationship between block exit and enter.

I read that the biggest cost of synchronization are the memory visibility guarantees, not the lock contention.
Let's say I am able to guarantee memory visibility via other means:

How do I make a group of statements atomic, without creating an happens-before relationship, that is without the memory visibility effects of synchronized/Lock?

I tried to implement a lock in user space via CAS, but the built-in one severely outperforms it, and memory barriers are still emitted by the CAS variable.


In this example, a mutex without memory visibility effects would suffice.

(release/acquire) int x; // Variable with release/acquire semantics

// release fence
synchronized (this) {

    int y = x;
    // acquire fence

    // release fence
    x = 5;

}
// acquire fence

The same set of fences is emitted twice (by the mutex and x). Does it cause unnecessary overhead?


Is a lock without memory effects theoretically possible?
Would a lock without memory effects actually be more performant?
Is there a built-in way to accomplish this in C++ and/or Java?
If there isn't, can it be implemented in C++ and/or Java?

like image 881
Atom 12 Avatar asked May 07 '20 16:05

Atom 12


People also ask

Can volatile make a non atomic operation to Atomic?

The volatile keyword is used: to make non atomic 64-bit operations atomic: long and double . (all other, primitive accesses are already guaranteed to be atomic!)

Are memory reads Atomic?

A memory operation can be non-atomic because it uses multiple CPU instructions, non-atomic even when using a single CPU instruction, or non-atomic because you're writing portable code and you simply can't make the assumption.


3 Answers

The costs of guaranteeing memory visibility in a mutex is negligible, in fact on x86 it is free.

Acquiring a mutex requires an atomic read-modify-write operation with acquire semantics. For releasing a mutex it is sufficient to use a simple store with release semantics. Consider a simple spin-lock - the acquire operation consists of a loop that repeatedly tries to set a lock flag to 1 if it is currently 0. To release the lock, the owning thread simply writes 0 in to lock flag. In many regards, such a simple spin-lock is far from optimal, and there are many designs for locks that try to improve that (e.g., fairness, spinning on local cache lines etc.), but in all these designs releasing the lock is certainly cheaper than acquiring it.

The x86 memory model is pretty strong: all atomic read-modify-write operations are sequentially consistent, all store operations have effectively release-, and all load operations acquire semantics. That's why on x86 releasing a mutex can be done with a normal store, no additional instructions are required to ensure visibility of memory effects. On architectures with weaker memory models like ARM or Power you do need additional instructions, but the cost is negligible compared to the cost of the acquire-operation. x86 also has special barrier instructions, but those are usually only relevant in certain cases in lock-free programming, and the cost of these instructions is about the same as some atomic read-modify write.

The real cost of a mutex is not the visibility of memory effects, but contention and the serialization of the execution. If the number of threads competing for the mutex is low, and the duration for which a thread holds the mutex is also low, then the overall performance impact will also be low. But if the number of threads fighting for the mutex is large, and the duration for which a thread holds the mutex is also large, then other threads will have to wait longer until they can finally acquire the mutex and continue execution. This reduces the work that can be performed within a given time frame.

I am not sure what you mean by "Is a lock without memory effects theoretically possible?". The whole purpose of a mutex is to allow some operations to be performed - and also observed - as if they were atomic. This implies that the effect of the operation becomes visible to the next owner of the mutex. This is actually what the happens-before relation guarantees. If a thread A acquires the mutex, and this acquire operation happens-after a release operation by some thread B, then due to the transitivity of the happens-before relation, the operations performed by B while holding the mutex must have happened before the operations A is about to perform - and that means all memory effects have to be visible. If this is not guaranteed, then your mutex is broken and you have a race condition.

Regarding the volatile variable in your example - the Java memory model requires that all operations on shared volatile variables are sequentially consistent. However, if x is only ever accessed inside a critical section (i.e., protected by some mutex), then it does not have to be volatile. Volatile is only needed if some threads access the variable without any other synchronization mechanisms like a mutex.

The release/acquire semantics of the mutex operations are necessary to order the operations inside the mutex. In C++ one could implement a mutex using relaxed operations. The lock/unlock operations on the mutex itself would still be totally ordered (due to the modification order of the mutex), but we would lose the happens-before relation, so the operations inside the mutex would be unordered. While this would be possible in C++ it would be rather absurd, because as I tried to explain, making the memory effects visible is very cheap (on x86 it is free), but you would lose a property that is absolutely crucial in virtually all cases. Note: the store operation to release the mutex is cheaper than a store to a volatile variable. Volatile variables are sequentially consistent, but releasing a mutex can be done with a release-store. (Of course the Java memory model is not as flexible as the C++ model, so you cannot really implement a hand-knitted lock using more relaxed acquire/release operations).

like image 154
mpoeter Avatar answered Nov 03 '22 02:11

mpoeter


A busy-waiting lock without an happens-before relationship between exit and enter can be implemented like this in Java:

private static final VarHandle STATE = ...;

private boolean state;

void lock() {
    while ((boolean) STATE.getAndSetRelease(this, true)) {
        while (state) {
            Thread.onSpinWait();
        }
    }
}

void unlock() {
    STATE.setOpaque(this, false);
}

In the above code, Thread.onSpinWait() on x86 prevents state from being cached indefinitely. On architectures where that is not the case, the following could be used instead:

while ((boolean) STATE.getOpaque(this)) {}
like image 31
Atom 12 Avatar answered Nov 03 '22 02:11

Atom 12


I was asking exactly the same question a while ago.

I solved for my particular use case with a simple piece of code. Other use cases will have different optimum solutions.


Example Use Case 1: Hot loop that needs to check an atomic

Assuming your use case is something like this: a hot loop has to spin as fast as possible, and it can't afford to be continuously checking an atomic (or volatile) variable as this involves synchronising state between two processor cores.

The solution is remarkably simple: only check every 1024 iterations. Why 1024? It's a power of 2, so any MOD operators get optimised to a fast bitwise AND calculation. Any other power of 2 will do. Tune to suit.

After that, the overhead of an atomic becomes negligible compared to the work the loop is performing.

It would be possible to implement other more complicated solutions. But these lines are sufficient:

// Within a hot loop running on a single core ...
int32_t local = 0;
if (local % 1024 == 0) // Optimised to a bitwise AND (checks if lower N bits are 0).
{ 
  // Check atomics, which may force the processor to synchronize state with another core.
}

Example Use Cases 2....N: The Realtime Bible

There is an excellent talk on various levels of locking for other use cases, see: Real time 101 - David Rowland & Fabian Renn Giles - Meeting C++ 2019.


Q. Is a lock without memory effects theoretically possible?

  • A. Yes. If two threads were pinned to the same core, then both threads could share state using registers. If two threads are running on different cores, they could share state using L1, L2 or L3 cached memory if they are next to each other on the die. If two threads are running on different cores, then usually they communicate shared state using a flush to RAM.

Q. Would a lock without memory effects actually be more performant?

  • A. Yes. However, this is really only applicable to embedded operating system or device drivers, see update below.

Q. Is there a built-in way to accomplish this in C++ and/or Java?

  • A. No. But the answers above can get very close, often by avoiding locks altogether for most of the time. Better to spend time to master a good profiler.

Q. If there isn't, can it be implemented in C++ and/or Java?

  • Q. No. Not realistic in a high level language, see update below.

Hardware interrupts are entirely distinct from software interrupts. A hardware interrupt causes the processor Intruction Pointer (IP) to switch to execute another service routine. So a lock without memory effects is theoretically possible if there are many "threads" running on a single core, and the "threads" (i.e. the Interrupt Service Routines triggered by hardware interrupts) communicate via registers in the CPU, or at the very least by an internal cache (L1, L2, L3), as this does not end up hitting RAM.

On a practical level, this is probably not relevant to any high level languages such as C++ or Java, and it is probably not relevant to user-mode processes in high-level operating systems such as Linux or Windows. It is probably only possible when using an embedded OS such as QMX, or perhaps when writing kernel-mode device drivers for Windows or Linux.

So in practice, a reasonable rule of thumb is to just assume that all locks have memory effects. If one is concerned about performance, run a profiler. If there are performance issues, choose from the selection of threading architectures in said Realtime Bible.

like image 42
Contango Avatar answered Nov 03 '22 00:11

Contango