Why struct with padding fields works faster

Tags:

I just found this library, that provides lock-free ring, that works way faster then channels: https://github.com/textnode/gringo (and it works really faster especially with GOMAXPROCS > 1 )

But interesting part is struct for managing queue state:

Click to copy

type Gringo struct {
    padding1 [8]uint64
    lastCommittedIndex uint64
    padding2 [8]uint64
    nextFreeIndex uint64
    padding3 [8]uint64
    readerIndex uint64
    padding4 [8]uint64
    contents [queueSize]Payload
    padding5 [8]uint64
}

If i remove "paddingX [8]uint64" fields it works about 20% slower. How it can be?

Also appreciate if someone explained why this lock-free algorithm much faster then channels, even buffered?

872

asked Oct 16 '13 07:10

Leonid Bugaev

2 Answers

Padding eliminates false sharing by putting each structure on its own cache line. If two variables share a cache line, a read of an unmodified variable will be as expensive as a read of a modified variable if there's an intervening write to the other variable.

When a variable is read on multiple cores and not modified, the cache line is shared by the cores. This makes the reads very cheap. Before any core can write to any part of that cache line, it must invalidate the cache line on other cores. If any core later reads from that cache line, it will find the cache line invalidated and have to go back to sharing it. This makes painful extra cache coherency traffic when one variable is frequently modified and the other is frequently read.

answered Oct 16 '22 06:10

David Schwartz

It works faster because it does not require locks. This is an implementation in Java (called Disruptor) which works really well, and seems to be the inspiration for gringo. They explain the cost of locks and how you can increase throughput here.

As for the padding, the paper also hints at some of the reasons. Basically: processor caches. This paper explains it well. You can gain tremendous performance gain by making the processor access its Level 1 cache instead of going through memory or its outer caches as often as possible. But this requires to take extra precautions as the processor will fully load its cache, and reload it (from memory or level 2-3 caches) every time it is required. In the case of concurrent data structure, as @David Schwartz said, the false sharing will force the processor to reload its cache much more often, as some data might be loaded in the rest of the memory line, be modified, and force the whole cache to be loaded again.

answered Oct 16 '22 08:10

val

Related questions
                            
                                JVM consumes all CPU, most threads as BLOCKED. JVM bug?
                            
                                Concurrent file access in Android
                            
                                FileInputStream and FileOutputStream to the same file: Is a read() guaranteed to see all write()s that "happened before"?
                            
                                Is there any limit on number of concurrent hits or simultaneous executions on Google App Script Web App
                            
                                Why does Java not see the updated value from another thread?
                            
                                ThreadPoolExecutor: how does it reuse threads
                            
                                Is variable assignment atomic in go?
                            
                                Difference between synchronization of field reads and volatile
                            
                                Equivalent of Akka but for .NET (Concurrency Framework)
                            
                                Java - Immutable array thread-safety
                            
                                Concurrent read/write to a variable in java
                            
                                What .NET 4.0 System.Collections.Concurrent collection added in functionality to .NET 3.0 SynchronizedCollection?
                            
                                Rails/ActiveRecord Pessimistic Locking - Do I need to reload after obtaining a lock?
                            
                                Locking strategy of git to achieve concurrency?
                            
                                Which is "better". AtomicIntegerArray (1/0 as true/false) versus AtomicBoolean[]?
                            
                                how many concurrent requests settings for IIS 8.5
                            
                                Why is the "move" keyword necessary when it comes to threads; why would I ever not want that behavior?
                            
                                How to track task execution statistics using an ExecutorService?
                            
                                what is the best way to synchronize container access between multiple threads in real-time application
                            
                                Thread-safely transforming a value in a mutable map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why struct with padding fields works faster

Tags:

concurrency

parallel-processing

go

Leonid Bugaev

People also ask

2 Answers

David Schwartz

val

Recent Activity

Donate For Us