Cache lines, false sharing and alignment

Q: What is false cache line sharing?

With false sharing, a thread is forced to fetch a more recent copy of a cache line from memory, even though the variable it is attempting to access has not been modified.

Q: Are cache lines aligned?

Typically a cache line is 32 bytes long and it is aligned to a 32 byte offset. First a block of memory, a memory line, is loaded into a cache line. This cost is a cache miss, the latency of memory. Then, after loading, bytes within a cache line can be referenced without penalty as long as it remains in the cache.

Q: How do you prevent false sharing cache?

In general, false sharing can be reduced using the following techniques: Make use of private or threadprivate data as much as possible. Use the compiler's optimization features to eliminate memory loads and stores. Pad data structures so that each thread's data resides on a different cache line.

Q: What type of cache misses are caused by false sharing?

Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing.

Tags:

c++

caching

multithreading

parallel-processing

I wrote the following short C++ program to reproduce the false sharing effect as described by Herb Sutter:

Say, we want to perform a total amount of WORKLOAD integer operations and we want them to be equally distributed to a number (PARALLEL) of threads. For the purpose of this test, each thread will increment its own dedicated variable from an array of integers, so the process may be ideally parallelizable.

Click to copy

void thread_func(int* ptr)
{
    for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i)
    {
        (*ptr)++;
    }
}

int main()
{
    int arr[PARALLEL * PADDING];
    thread threads[PARALLEL];

    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * PADDING]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

I think the idea is easy to grasp. If you set

Click to copy

#define PADDING 16

every thread will work on a separate cache line (assuming the length of a cache line to be 64 bytes). So the result will be linear increase of speedup until PARALLEL > # cores. If, on the other hand, PADDING is set to any value below 16, one should encounter severe contention, for at least two threads are now likely to operate on the same cache line, which however is protected by a built-in hardware mutex. We would expect our speedup not only to be sublinear in this case, but even to be always < 1, because of the invisible lock convoy.

Now, my first attempts nearly satisfied these expectations, yet the minimum value of PADDING needed to avoid false sharing was around 8 and not 16. I was quite puzzled for about half an hour until I came to the obvious conclusion, that there is no guarantee for my array to be aligned exactly to the beginning of a cache line inside main memory. The actual alignment may vary depending on many conditions, including the size of the array.

In this example, there is of course no need for us to have the array aligned in a special way, because we can just leave PADDING at 16 and everything works out fine. But one could imagine cases, where it does make a difference, whether a certain structure is aligned to a cache line or not. Hence, I added some lines of code to get some information about the actual alignment of my array.

Click to copy

int main()
{
    int arr[PARALLEL * 16];
    thread threads[PARALLEL];
    int offset = 0;

    while (reinterpret_cast<int>(&arr[offset]) % 64) ++offset;
    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * 16 + offset]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

Despite this solution worked out fine for me in this case, I'm not sure if it would be a good approach in general. So here is my question:

Is there any common way to have objects in memory aligned to cache lines other than what I did in the above example?

(using g++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

726

asked Aug 14 '13 15:08

Rene R.

2 Answers

You should be able to request the required alignment from the compiler:

Click to copy

alignas(64) int arr[PARALELL * PADDING]; // align the array to a 64 byte line

177

answered Sep 18 '22 09:09

David Rodríguez - dribeas

gcc supports an aligned keyword: http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

You probably want something like this:

int arr[PARALLEL * 16] __attribute__ ((aligned (8)));

This aligns arr to an eight-byte boundary.

Visual Studio has a similar feature, too: http://msdn.microsoft.com/en-us/library/83ythb65.aspx

answered Sep 20 '22 09:09

pattivacek

Related questions
                            
                                How to implement a good debug/logging feature in a project
                            
                                What is the rationale for renaming monotonic_clock to steady_clock in <chrono>?
                            
                                Call default copy constructor from within overloaded copy constructor
                            
                                Using C++ ifstream extraction operator>> to read formatted data from a file
                            
                                is there a c++ library for ordinary differential equation (ODE) solvers?
                            
                                class template instantiation
                            
                                Error 4 error C3861: 'snprintf': identifier not found
                            
                                Intel standard library (C++)
                            
                                How to set decode pixel format in libavcodec?
                            
                                Usage of "this" in destructor
                            
                                Virtual tables and memory layout in multiple virtual inheritance
                            
                                eclipse-cdt how to link with static library
                            
                                How to store stl objects in shared memory (C++)?
                            
                                Why is this code 100 times slower in debug?
                            
                                How to remove a certain row or column while using Eigen Library c++
                            
                                what are the differences between std::move and unique_ptr::reset?
                            
                                Thread per connection vs Reactor pattern (with a thread pool)?
                            
                                structure does not name a type in c++
                            
                                difference between socket.connect and boost::asio::connect
                            
                                g++ always backward-compatible with "older" static libraries?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cache lines, false sharing and alignment

Tags:

c++

caching

multithreading

parallel-processing

Rene R.

People also ask

2 Answers

David Rodríguez - dribeas

pattivacek

Recent Activity

Donate For Us