Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.

Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.

I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.

If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?

Thank you.

like image 649
PaulD Avatar asked Jan 09 '23 13:01

PaulD


1 Answers

C++ principally a pay-for-what-you-need eco-system.

Any regular queue will let you choose the storage semantics (by value or by reference).

However, this time you ordered something special: you ordered a lock free queue. In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.

You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).

Good question.

Enter Ringbuffers

Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures. But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?

No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].

As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from

  • linear memory access patterns while reading/writing
  • even better if you can make the record size aligned with (a multiple) physical cache lines
  • all the data is local unless the POD contains raw references outside that record

How Does Boost's spsc_queue Actually Do This?

  1. Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)

    typedef typename boost::aligned_storage<max_size * sizeof(T),
                                            boost::alignment_of<T>::value
                                           >::type storage_type;
    
    storage_type storage_;
    
    T * data()
    {
        return static_cast<T*>(storage_.address());
    }
    

    (The element type T need not even be POD, but it needs to be both default-constructible and copyable).

  2. Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):

    static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
    atomic<size_t> write_index_;
    char padding1[padding_size]; /* force read_index and write_index to different cache lines */
    atomic<size_t> read_index_;
    
  3. In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.

  4. Several other optimizations are present:

    • branch prediction hints for platforms that support it (unlikely())
    • it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
    • use of std::unitialized_copy where possible
    • The calling of trivial constructors/destructors will be optimized out at instantiation time
    • the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)

All in all, we see a best-in-class possible idea for a ringbuffer

What To Use

Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.

On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).

So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).

If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.

Let your profiler guide you.


[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

like image 165
sehe Avatar answered Jan 22 '23 11:01

sehe