How to optimize an algorithm for a given multi-core architecture

Tags:

I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?

I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?

Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.

910

asked Mar 05 '15 04:03

Shyam

1 Answers

What techniques I should look up-to for optimizing a given algorithm for a given architecture?

Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:

8.3 OPTIMIZATION GUIDELINES This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):

Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization

Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.

8.6 MEMORY OPTIMIZATION

Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:

Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache

What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?

Exploit

When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:

Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As @PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.

Avoid

Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data

Measure

As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:

perf
Cachegrind

162

answered Sep 30 '22 20:09

amdn

Related questions
                            
                                C# suspending all threads
                            
                                Swing Thread Safe Programming
                            
                                python threading.Event: add event params
                            
                                Thread-safe Settings
                            
                                Multi threaded single file writing in Java
                            
                                Waiting for Navigation Drawer to close before initializing Fragments with empty while-loop
                            
                                dispatch work from many threads to one synchronous thread
                            
                                Replacement for SynchronizationContext.Send() in Portable Class Libraries
                            
                                Why Thread.CurrentContext property and Thread.GetDomain() method?
                            
                                Why do I have worst performance on my spinlock implementation when I use non-cst memory model?
                            
                                Are all the web requests executed in parallel and handled asynchronously?
                            
                                Python process hanging due to open Paramiko ssh connections
                            
                                Sqlalchemy: Using decorator to provide thread-safe session for multiple functions
                            
                                Can a synchronized block/method be interrupted?
                            
                                Can you programmatically alter a queue's "dead letter" handling in a Java embedded broker?
                            
                                C++11 event loop with thread safe queue
                            
                                Does C# Parallel.ForEach use the same thread for iterations of the collection
                            
                                use Guava CharMatcher as a static fields in a class . Is CharMatcher thread safe?
                            
                                Easiest way to simulate the maximum CPU load?
                            
                                How to get ThreadStatic value of another thread?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to optimize an algorithm for a given multi-core architecture

Tags:

optimization

caching

multithreading

multicore

computer-architecture

Shyam

People also ask

1 Answers

amdn

Recent Activity

Donate For Us