I have a basic block-based allocator that will be used to allocate memory for POD-types in a multi-threaded setting, where all threads will be requesting or releasing memory simultaneously to a shared memory pool. My current implementation is detailed here:
#include <concurrent_queue.h>
#include <atomic>
template <typename T, size_t BlockSize> class BlockAlloc {
private:
// element_t is either the data for T or a pointer to the next free element_t
struct element_t {
::byte buffer[sizeof(T) > sizeof(element_t*) ? sizeof(T) : sizeof(element_t*)];
element_t*& next() {return reinterpret_cast<element_t*&>(buffer);};
T*& data() {return reinterpret_cast<T*&>(buffer);};
};
// block_t is a contiguous block of elements
struct block_t {
element_t elements[BlockSize];
};
// Allocate one new block of contiguous elements
void AllocBlock() {
block_t* block = new block_t();
// each element in the block points to the previous element...
for (int i = 1; i < BlockSize; i++)
block->elements[i].next() = &block->elements[i - 1];
// free should point to the final element in the block,
// while the first element of the block points to free's old location.
// Update made with compare-and-swap loop:
while (!free.compare_exchange_strong(block->elements[0].next() = free.load(), &block->elements[BlockSize-1])) {}
// add the block to the list of all allocated blocks
blocks.push(block);
};
// Release all memory held by all blocks
void ReleaseBlocks() {
free = nullptr;
block_t* block {nullptr};
while (blocks.try_pop(block)) delete block;
};
public:
BlockAlloc() : blocks {}, free(nullptr) { AllocBlock(); };
~BlockAlloc() { ReleaseBlocks(); };
// Acquire a new element from the free list and construct it.
template <typename... TArgs> _type_* Alloc(TArgs &&... a) {
_type_* data {nullptr};
element_t* element {nullptr};
// Grab the free element and set free to the next item in that list
// Update made with compare-and-swap loop:
while (true) {
if (element = free.load()) {
// CRASHES HERE DURING THE CMPxSWP,
// 0xC0000005: "access violation reading location"
if (free.compare_exchange_strong(element, element->next())) {
data = element->data();
new (data) _type_(std::forward<TArgs>(a)...);
return data;
}
}
else {
// if free is empty, grow the list
AllocBlock();
}
}
};
// Destroys the element and return its memory to the free list
void Free(_type_* element) {
if (element == nullptr) {return;}
element->~_type_();
element_t* t = (element_t*)(element);
while (true)
if (free.compare_exchange_strong(t->next() = free.load(), t))
return;
};
concurrency::concurrent_queue<block_t*> blocks;
std::atomic<element_t*> free;
};
Running in a single thread this implementation works correctly, however when testing in a multi-threaded application, this code will quickly crash at the compare-and-swap operation of the Alloc call, with exception:
0xC0000005: "access violation reading location".
I cannot tell by inspection what I am doing incorrectly that would cause this exception to be thrown. Any advice or input?
free
attempts to be a lock-free stack of addresses, but it falls short in ABA protection. Thread can update the free
pointer while another thread is in middle of storing (a stale) next pointer:
A: while (!free.compare_exchange_strong(block->elements[0].next() = free.load(), &block->elements[BlockSize-1])) {}
A: x = load(free) B: y = load(free)
store(next, x)
<thread A goes to lunch>
B: reorders top two items in the stack.
A: x is now stale.
CAS(free, x, addr)
You can use a 16-byte CAS and a counter that you increment on each attempted CAS. However, due to the standard library ABI, GCC/Clang can refuse to emit the CMPXCHG16B
on x86-64. So static_assert(std::atomic<bytes16>::is_always_lock_free);
would error, and you would get bad performance if you ignored this.
The last (portable-and-performant, but limited count of threads) option I can think of is to align the memory addresses to say 32-byte boundaries and use the 5 low zero bits of the address as an ABA counter. Non-portable, (but common) way is to use the high 8 or 16-bits of the virtual address for the ABA-tag.
Edit: I'd like add a note about the ABA counter. It is enough that each thread stores an per-thread unique tag with the CAS. Such a per-thread constant tag is likely better than a incrementing counter since it cannot overflow, which can silently break an otherwise working lock-free algorithm.
Why is your element
buffer
storingnext
pointers anddata
values in the same memory (and why are you not using aunion
for that purpose)?
Space-compaction, not that this matters much here. The next pointer is only needed for items that are in the stack. (i.e. free). AllocBlock()
pushes a block of addresses at once. Alloc()
pops a single address, after which the succeeding thread owns that piece of memory. (Push/pop requires ABA protection.) Free()
then pushes an address onto the stack to be reused by Alloc()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With