Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why struct with padding fields works faster

I just found this library, that provides lock-free ring, that works way faster then channels: https://github.com/textnode/gringo (and it works really faster especially with GOMAXPROCS > 1 )

But interesting part is struct for managing queue state:

type Gringo struct {
    padding1 [8]uint64
    lastCommittedIndex uint64
    padding2 [8]uint64
    nextFreeIndex uint64
    padding3 [8]uint64
    readerIndex uint64
    padding4 [8]uint64
    contents [queueSize]Payload
    padding5 [8]uint64
}

If i remove "paddingX [8]uint64" fields it works about 20% slower. How it can be?

Also appreciate if someone explained why this lock-free algorithm much faster then channels, even buffered?

like image 872
Leonid Bugaev Avatar asked Oct 16 '13 07:10

Leonid Bugaev


People also ask

What is the advantage of structure padding?

Padding increases the performance of the processor at the penalty of memory. In structure or union data members are aligned as per the size of the highest bytes member to prevent the penalty of performance.

How does struct padding work?

In order to align the data in memory, one or more empty bytes (addresses) are inserted (or left empty) between memory addresses which are allocated for other structure members while memory allocation. This concept is called structure padding.

Why are padding bytes added while calculating structure size?

structure A The compiler will insert a padding byte after the char to ensure short int will have an address multiple of 2 (i.e. 2 byte aligned).

Why is padding necessary?

Carpet cushion or padding works to protect the under layer of your home's carpet from wearing against the base floor the carpet lays on top of. It also protects the top layer of your carpet from furniture indentations and the wear and tear of foot traffic.


2 Answers

Padding eliminates false sharing by putting each structure on its own cache line. If two variables share a cache line, a read of an unmodified variable will be as expensive as a read of a modified variable if there's an intervening write to the other variable.

When a variable is read on multiple cores and not modified, the cache line is shared by the cores. This makes the reads very cheap. Before any core can write to any part of that cache line, it must invalidate the cache line on other cores. If any core later reads from that cache line, it will find the cache line invalidated and have to go back to sharing it. This makes painful extra cache coherency traffic when one variable is frequently modified and the other is frequently read.

like image 91
David Schwartz Avatar answered Oct 16 '22 06:10

David Schwartz


It works faster because it does not require locks. This is an implementation in Java (called Disruptor) which works really well, and seems to be the inspiration for gringo. They explain the cost of locks and how you can increase throughput here.

As for the padding, the paper also hints at some of the reasons. Basically: processor caches. This paper explains it well. You can gain tremendous performance gain by making the processor access its Level 1 cache instead of going through memory or its outer caches as often as possible. But this requires to take extra precautions as the processor will fully load its cache, and reload it (from memory or level 2-3 caches) every time it is required. In the case of concurrent data structure, as @David Schwartz said, the false sharing will force the processor to reload its cache much more often, as some data might be loaded in the rest of the memory line, be modified, and force the whole cache to be loaded again.

like image 40
val Avatar answered Oct 16 '22 08:10

val