Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple threads and CPU cache

Tags:

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?

Suppose i have a memory layout like the following

int output[100];

Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.

like image 979
Zahid Hossain Avatar asked Jan 26 '11 08:01

Zahid Hossain


People also ask

Can one CPU run multiple threads?

Yes you can do multithreading on a single processor system. In multi-processor system , multiple threads execute , simultaneously on different cores. Eg- If there are two threads and two cores , then each thread would run on individual core.

Do threads share caches?

On modern multi-core processors, independent workloads often interfere with each other by competing for shared cache space. However, for multi-threaded workloads, where a single copy of data can be accessed by multiple threads, the threads can cooperatively share cache.

What is multithreaded cache?

Computers with multiple threads of execution, either with multiple processors, multiple cores per processor, or both, introduce additional complexity to caches.

What is multi threading in CPU?

Multithreading is a form of parallelization or dividing up work for simultaneous processing. Instead of giving a large workload to a single core, threaded programs split the work into multiple software threads. These threads are processed in parallel by different CPU cores to save time.


1 Answers

The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.

For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.

Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.


A few comments have asked about how to go about avoiding this problem.

At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).

To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.

If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.

If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.

On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.

like image 111
caf Avatar answered Sep 28 '22 04:09

caf