I just found this library, that provides lock-free ring, that works way faster then channels: https://github.com/textnode/gringo (and it works really faster especially with GOMAXPROCS > 1 )
But interesting part is struct for managing queue state:
type Gringo struct {
padding1 [8]uint64
lastCommittedIndex uint64
padding2 [8]uint64
nextFreeIndex uint64
padding3 [8]uint64
readerIndex uint64
padding4 [8]uint64
contents [queueSize]Payload
padding5 [8]uint64
}
If i remove "paddingX [8]uint64" fields it works about 20% slower. How it can be?
Also appreciate if someone explained why this lock-free algorithm much faster then channels, even buffered?
Padding increases the performance of the processor at the penalty of memory. In structure or union data members are aligned as per the size of the highest bytes member to prevent the penalty of performance.
In order to align the data in memory, one or more empty bytes (addresses) are inserted (or left empty) between memory addresses which are allocated for other structure members while memory allocation. This concept is called structure padding.
structure A The compiler will insert a padding byte after the char to ensure short int will have an address multiple of 2 (i.e. 2 byte aligned).
Carpet cushion or padding works to protect the under layer of your home's carpet from wearing against the base floor the carpet lays on top of. It also protects the top layer of your carpet from furniture indentations and the wear and tear of foot traffic.
Padding eliminates false sharing by putting each structure on its own cache line. If two variables share a cache line, a read of an unmodified variable will be as expensive as a read of a modified variable if there's an intervening write to the other variable.
When a variable is read on multiple cores and not modified, the cache line is shared by the cores. This makes the reads very cheap. Before any core can write to any part of that cache line, it must invalidate the cache line on other cores. If any core later reads from that cache line, it will find the cache line invalidated and have to go back to sharing it. This makes painful extra cache coherency traffic when one variable is frequently modified and the other is frequently read.
It works faster because it does not require locks. This is an implementation in Java (called Disruptor) which works really well, and seems to be the inspiration for gringo. They explain the cost of locks and how you can increase throughput here.
As for the padding, the paper also hints at some of the reasons. Basically: processor caches. This paper explains it well. You can gain tremendous performance gain by making the processor access its Level 1 cache instead of going through memory or its outer caches as often as possible. But this requires to take extra precautions as the processor will fully load its cache, and reload it (from memory or level 2-3 caches) every time it is required. In the case of concurrent data structure, as @David Schwartz said, the false sharing will force the processor to reload its cache much more often, as some data might be loaded in the rest of the memory line, be modified, and force the whole cache to be loaded again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With