I have some immutable data structures that I would like to manage using reference counts, sharing them across threads on an SMP system. Here's what the release code looks like: <pre class="prettyprint"><code>void avocado_release(struct avocado *p) { if (atomic_dec(p->refcount) == 0) { free(p->pit); free(p->juicy_innards); free(p); } } </code></pre> Does <code>atomic_dec</code> need a memory barrier in it? If so, what kind of memory barrier? Additional notes: The application must run on PowerPC and x86, so any processor-specific information is welcomed. I already know about the GCC atomic builtins. As for immutability, the refcount is the only field that changes over the duration of the object.

On x86, it will turn into a <code>lock</code> prefixed assembly instruction, like <code>LOCK XADD</code>. Being a single instruction, it is non-interruptible. As an added "feature", the <code>lock</code> prefix results in a full memory barrier: <blockquote> "...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2. </blockquote> A memory barrier is in fact implemented as a dummy <code>LOCK OR</code> or <code>LOCK AND</code> in both the .NET and the JAVA JIT on x86/x64, because <code>mfence</code> is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?) So you have a full fence on x86 as an added bonus, whether you like it or not. :-) On PPC, it is different. An LL/SC pair - <code>lwarx</code> & <code>stwcx</code> - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry). It also does not mean an automatic full fence. This does not however compromise the atomicity of the counter in any way. It just means that in the x86 case, you happen to get a fence as well, "for free". On PPC, one can insert a (partial or) full fence by emitting a <code>(lw)sync</code> instruction. All in all, explicit memory barriers are not necessary for the atomic counter to work properly.

Are memory barriers necessary for atomic reference counting shared immutable data?

Tags:

c

multithreading

atomic

refcounting

memory-barriers

I have some immutable data structures that I would like to manage using reference counts, sharing them across threads on an SMP system.

Here's what the release code looks like:

void avocado_release(struct avocado *p)
{
    if (atomic_dec(p->refcount) == 0) {
        free(p->pit);
        free(p->juicy_innards);
        free(p);
    }
}

Does atomic_dec need a memory barrier in it? If so, what kind of memory barrier?

Additional notes: The application must run on PowerPC and x86, so any processor-specific information is welcomed. I already know about the GCC atomic builtins. As for immutability, the refcount is the only field that changes over the duration of the object.

961

asked Apr 08 '10 10:04

Dietrich Epp

2 Answers

On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feature", the lock prefix results in a full memory barrier:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64, because mfence is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)
So you have a full fence on x86 as an added bonus, whether you like it or not. :-)

On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
_{On PPC, one can insert a (partial or) full fence by emitting a (lw)sync instruction.}

All in all, explicit memory barriers are not necessary for the atomic counter to work properly.

answered Sep 19 '22 11:09

Andras Vass

It is important to distinguish between atomic accesses (which guarantee that the read/modify/write of the value executes as one atomic unit) vs. memory reordering.

Memory barriers prevent reordering of reads and writes. Reordering is completely orthogonal to atomicity. For instance, on PowerPC if you implement the most efficient atomic increment possible then it will not prevent reordering. If you want to prevent reordering then you need an lwsync or sync instruction, or some equivalent high-level (C++ 11?) memory barrier.

Claims that there is "no possibility of the compiler reordering things in a problematic way" seem naive as general statements because compiler optimizations can be quite surprising and because CPUs (PowerPC/ARM/Alpha/MIPS in particular) aggressively reorder memory operations.

A coherent cache doesn't save you either. See https://preshing.com/archives/ to see how memory reordering really works.

In this case, however, I believe the answer is that no barriers are required. That is because for this specific case (reference counting) there is no need for a relationship between the reference count and the other values in the object. The one exception is when the reference count hits zero. At that point it is important to ensure that all updates from other threads are visible to the current thread so a read-acquire barrier may be necessary.

answered Sep 19 '22 11:09

Bruce Dawson

Related questions
                            
                                Why is fseek or fflush always required between reading and writing in the update modes?
                            
                                Overcoming C limitations for large projects
                            
                                What could be the fastest and least painful way to learn LISP for a C developer?
                            
                                How does a sorting network beat generic sorting algorithms?
                            
                                Declare main prototype
                            
                                Why does gcc output machine code have nop instructions
                            
                                order of evaluation of operands
                            
                                How to allocate and deallocate heap memory for 2D array?
                            
                                Concatenate string in C #include filename
                            
                                MPI Get Processor with Minimum value
                            
                                Does accessing the first field of a struct via a C cast violate strict aliasing?
                            
                                Does a string literal count as a partial initializer and zero-initialize?
                            
                                how to see step by step execution of each line of a macro expansion using gdb
                            
                                What happens when a variable goes out of scope?
                            
                                posix_spawnp and piping child output to a string
                            
                                Casting twice in the same line
                            
                                warning: discards 'const' qualifiers from pointer target type
                            
                                valid and invalid syntax for representing the size of array during initialisation
                            
                                clear buffer cache on Mac OS X
                            
                                Do I need to #include <omp.h> in my C/C++ sources?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With