Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize an algorithm for a given multi-core architecture

I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?

I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?

Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.

like image 910
Shyam Avatar asked Mar 05 '15 04:03

Shyam


People also ask

What is multicore optimization?

A multicore processor is an integrated circuit that has two or more processor cores attached for enhanced performance and reduced power consumption. These processors also enable more efficient simultaneous processing of multiple tasks, such as with parallel processing and multithreading.

What do you mean by processor optimization?

The CPU handles all of the calculations a computer needs in order to function. Optimizing the speed at which the CPU runs increases performance and allows the computer to perform more CPU-heavy tasks. Optimizing the CPU is done through the Power Options section of the Control Panel in the Windows operating system.


1 Answers

What techniques I should look up-to for optimizing a given algorithm for a given architecture?

Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:

8.3 OPTIMIZATION GUIDELINES This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):

  • Thread synchronization
  • Bus utilization
  • Memory optimization
  • Front end optimization
  • Execution resource optimization

Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.

8.6 MEMORY OPTIMIZATION

Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:

  • Cache blocking
  • Shared memory optimization
  • Eliminating 64-KByte aliased data accesses
  • Preventing excessive evictions in first-level cache

What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?

Exploit

When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:

  • Try to pack data that is used close in time in the algorithm such that it is close in space (address)
  • Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
  • Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
  • As @PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
  • In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.

Avoid

  • Data that you don't use often shouldn't be close to data that you use frequently
  • Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
  • Try not to use new data until you are done with the older data

Measure

As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:

  • perf
  • Cachegrind
like image 162
amdn Avatar answered Sep 30 '22 20:09

amdn