Multiple threads and CPU cache

Tags:

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?

Suppose i have a memory layout like the following

int output[100];

Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.

979

asked Jan 26 '11 08:01

Zahid Hossain

1 Answers

The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.

For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.

Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.

A few comments have asked about how to go about avoiding this problem.

At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).

To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.

If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.

If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.

On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.

111

answered Sep 28 '22 04:09

caf

Related questions
                            
                                SVG clipPath to clip the *outer* content out
                            
                                Convert char* to wchar* in C
                            
                                Seamless git svn setup
                            
                                What is the difference between a macro and a function in C? [closed]
                            
                                Unfixable mixed-revision working copy in SVN
                            
                                CodeSign collisions between Developer and Enterprise Distribution
                            
                                Accessing instrumentation test resources
                            
                                remove jsessionid in url rewrite in spring mvc
                            
                                Is it possible to assign Multiple Validation Groups to a single Validation Control?
                            
                                RTTI Overhead in C++
                            
                                iPhone simulator sandbox location in Xcode 4
                            
                                Class.getMethod when parameter is varargs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple threads and CPU cache

Tags:

Zahid Hossain

People also ask

1 Answers

caf

Recent Activity

Donate For Us