How avoid cache line invalidation from multiple threads writing to a shared array?

Context of the problem:

I am writing a code that creates 32 threads, and set affinity of them to each one of the 32 cores in my multi-core-multi-processor system.

Threads simply execute the RDTSCP instruction and the value is stored in a shared array at a non-overlapping position, this is the shared array:

Click to copy

uint64_t rdtscp_values[32];

So, every thread is going to write to the specific array position based on its core number.

Up to know, everything is working properly with the exception that I know that I may not be using the right data structure to avoid cache line bouncing.

P.S: I have checked already that my processor's cache line is 64-bytes wide.

Because I am using a simple uint64_t array, it implies that a single cache line is going to store 8 positions of this array, because of the read-ahead.

Question:

Because of this simple array, although the threads write to different indexes, my understanding tells that every write to this array will cause a cache invalidation to all other threads?

How could I create a structure that is aligned to the cache line?

EDIT 1

My system is: 2x Intel Xeon E5-2670 2.30GHz (8 cores, 16 threads)

673

asked Oct 24 '17 11:10

campescassiano

1 Answers

Yes you definitely want to avoid "false sharing" and cache-line ping-pong. But this probably doesn't make sense: if these memory locations are thread-private more often than they're collected by other threads, they should be stored with other per-thread data so you're not wasting cache footprint on 56 bytes of padding. See also Cache-friendly way to collect results from multiple threads. (There's no great answer; avoid designing a system that needs really fine-grained gathering of results if you can.)

But let's just assume for a minute that unused padding between slots for different threads is actually what you want.

Yes, you need the stride to be 64 bytes (1 cache line), but you don't actually need the 8B you're using to be at the start of each cache line. Thus, you don't need any extra alignment as long as the uint64_t objects are naturally-aligned (so they aren't split across a cache-line boundary).

It's fine if each thread is writing to the 3rd qword of its cache line instead of the 1st. OTOH, aligning to 64B makes sure nothing else is sharing a cache line with the first element, and it's easy so we might as well.

Static storage: aligning static storage is very easy in ISO C11 using alignas(), or with compiler-specific stuff.

With a struct, padding is implicit to make the size a multiple of the required alignment. Having one member with an alignment requirement implies that the whole struct requires at least that much alignment. The compiler takes care of this for you with static and automatic storage, but you have to use aligned_alloc or an alternative for over-aligned dynamic allocation.

Click to copy

#include <stdalign.h>   // for #define alignas _Alignas  for C++ compat
#include <stdint.h>     // for uint64_t

// compiler knows the padding is just padding
struct { alignas(64) uint64_t v; } rdtscp_values[32];

int foo(unsigned t) {
    rdtscp_values[t].v = 1;
    return sizeof(rdtscp_values[0]);  // yes, this is 64
}

Or with an array as suggested by @ Eric Postpischil:

Click to copy

alignas(64) // optional, stride will still be 64B without this.
uint64_t rdtscp_values_2d[32][8];  // 8 uint64_t per cache line

void bar(unsigned t) {
    rdtscp_values_2d[t][0] = 1;
}

alignas() is optional if you don't care about the whole thing being 64B aligned, just having 64B stride between elements you use. You could also use __attribute__((aligned(64))) in GNU C or C++, or __declspec(align(64)) for MSVC, using #ifdef to define an ALIGN macro that's portable across the major x86 compilers.

Either way produces the same asm. We can check compiler output to verify that we got what we wanted. I put it up on the Godbolt compiler explorer. We get:

Click to copy

foo:   # and same for bar
    mov     eax, edi              # zero extend 32-bit to 64-bit
    shl     rax, 6                # *64 is the same as <<6
    mov     qword ptr [rax + rdtscp_values], 1    # store 1

    mov     eax, 64               # return value = 64 = sizeof(struct)
    ret

Both arrays are declared the same way, with the compiler requesting 64B alignment from the assembler/linker with the 3rd arg to .comm:

Click to copy

    .comm   rdtscp_values_2d,2048,64
    .comm   rdtscp_values,2048,64

Dynamic storage:

If the number of threads is not a compile-time constant, then you can use an aligned allocation function to get aligned dynamically-allocated memory (especially if you want to support a very high number of threads). See How to solve the 32-byte-alignment issue for AVX load/store operations?, but really just use C11 aligned_alloc. It's perfect for this, and returns a pointer that's compatible with free().

Click to copy

struct { alignas(64) uint64_t v; } *dynamic_rdtscp_values;
void init(unsigned nthreads) {
    size_t sz = sizeof(dynamic_rdtscp_values[0]);
    dynamic_rdtscp_values = aligned_alloc(nthreads*sz, sz);
}

void baz(unsigned t) {
    dynamic_rdtscp_values[t].v = 1;
}


 baz:
    mov     rax, qword ptr [rip + dynamic_rdtscp_values]

    mov     ecx, edi            # same code as before to scale by 64 bytes
    shl     rcx, 6
    mov     qword ptr [rax + rcx], 1
    ret

The address of the array is no longer a link-time constant, so there's an extra level of indirection to access it. But the pointer is read-only after it's initialized, so it will stay shared in cache in each core and reloading it when needed is very cheap.

Footnote: In the i386 System V ABI, uint64_t only has 4B-alignment inside structs by default (without alignas(8) or __attribute__((aligned(8)))), so if you put an int before a uint64_t and didn't do any alignment of the whole struct, it would be possible to get cache-line splits. But compilers align it by 8B whenever possible, so your struct-with padding is still fine.

116

answered Oct 01 '22 15:10

Peter Cordes

Related questions
                            
                                vsnprintf and NULL input string argument
                            
                                What are the consequences of changing a symbol from .globl to .weak?
                            
                                struct look up table using string pointers issue
                            
                                Why printk doesn't print message in kernel log(dmesg)
                            
                                C - function returning a pointer to a local variable
                            
                                'invalid controlling predicate' error when compiled with-fopenmp
                            
                                Array size with const variable in C [duplicate]
                            
                                Bluetooth LE L2CAP CID vs. PSM
                            
                                Can this cython code be optimized?
                            
                                Does strncat() always null terminate?
                            
                                C - using fork() and exec() twice
                            
                                Does this C inheritance implementation contain undefined behavior?
                            
                                Are the prologue and epilogue mandatory when writing assembly functions?
                            
                                Place segments of external static library to specific locations
                            
                                Converting Python dictionary to ctypes structure
                            
                                Writing a safe tagged union in C
                            
                                when pthread_attr_t is not NULL?
                            
                                Rationale of static declaration followed by non-static declaration allowed but not vice versa
                            
                                Get size of x86-64 instruction
                            
                                Why does pow() subtract 1 from my result? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How avoid cache line invalidation from multiple threads writing to a shared array?

Tags:

c

caching

multithreading

x86-64

Context of the problem:

Question:

EDIT 1

campescassiano

People also ask

1 Answers

Peter Cordes

Recent Activity

Donate For Us