Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cache lines, false sharing and alignment

I wrote the following short C++ program to reproduce the false sharing effect as described by Herb Sutter:

Say, we want to perform a total amount of WORKLOAD integer operations and we want them to be equally distributed to a number (PARALLEL) of threads. For the purpose of this test, each thread will increment its own dedicated variable from an array of integers, so the process may be ideally parallelizable.

void thread_func(int* ptr)
{
    for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i)
    {
        (*ptr)++;
    }
}

int main()
{
    int arr[PARALLEL * PADDING];
    thread threads[PARALLEL];

    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * PADDING]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

I think the idea is easy to grasp. If you set

#define PADDING 16

every thread will work on a separate cache line (assuming the length of a cache line to be 64 bytes). So the result will be linear increase of speedup until PARALLEL > # cores. If, on the other hand, PADDING is set to any value below 16, one should encounter severe contention, for at least two threads are now likely to operate on the same cache line, which however is protected by a built-in hardware mutex. We would expect our speedup not only to be sublinear in this case, but even to be always < 1, because of the invisible lock convoy.

Now, my first attempts nearly satisfied these expectations, yet the minimum value of PADDING needed to avoid false sharing was around 8 and not 16. I was quite puzzled for about half an hour until I came to the obvious conclusion, that there is no guarantee for my array to be aligned exactly to the beginning of a cache line inside main memory. The actual alignment may vary depending on many conditions, including the size of the array.

In this example, there is of course no need for us to have the array aligned in a special way, because we can just leave PADDING at 16 and everything works out fine. But one could imagine cases, where it does make a difference, whether a certain structure is aligned to a cache line or not. Hence, I added some lines of code to get some information about the actual alignment of my array.

int main()
{
    int arr[PARALLEL * 16];
    thread threads[PARALLEL];
    int offset = 0;

    while (reinterpret_cast<int>(&arr[offset]) % 64) ++offset;
    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * 16 + offset]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

Despite this solution worked out fine for me in this case, I'm not sure if it would be a good approach in general. So here is my question:

Is there any common way to have objects in memory aligned to cache lines other than what I did in the above example?

(using g++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

like image 726
Rene R. Avatar asked Aug 14 '13 15:08

Rene R.


People also ask

What is false cache line sharing?

With false sharing, a thread is forced to fetch a more recent copy of a cache line from memory, even though the variable it is attempting to access has not been modified.

Are cache lines aligned?

Typically a cache line is 32 bytes long and it is aligned to a 32 byte offset. First a block of memory, a memory line, is loaded into a cache line. This cost is a cache miss, the latency of memory. Then, after loading, bytes within a cache line can be referenced without penalty as long as it remains in the cache.

How do you prevent false sharing cache?

In general, false sharing can be reduced using the following techniques: Make use of private or threadprivate data as much as possible. Use the compiler's optimization features to eliminate memory loads and stores. Pad data structures so that each thread's data resides on a different cache line.

What type of cache misses are caused by false sharing?

Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing.


2 Answers

You should be able to request the required alignment from the compiler:

alignas(64) int arr[PARALELL * PADDING]; // align the array to a 64 byte line
like image 177
David Rodríguez - dribeas Avatar answered Sep 18 '22 09:09

David Rodríguez - dribeas


gcc supports an aligned keyword: http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

You probably want something like this:

int arr[PARALLEL * 16] __attribute__ ((aligned (8)));

This aligns arr to an eight-byte boundary.

Visual Studio has a similar feature, too: http://msdn.microsoft.com/en-us/library/83ythb65.aspx

like image 36
pattivacek Avatar answered Sep 20 '22 09:09

pattivacek