Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I prefer stride one memory access for either reading or writing?

It's well known that accessing memory in a stride one fashion is best for performance.

In situations where

  • I must access one region of memory for reading,
  • I must access another region for writing, and
  • I may only access one of the two regions in a stride one fashion,

should I prefer reading stride one or writing stride one?

One simple, concrete example is a BLAS-like copy-and-permute operation like y := P x. The permutation matrix P is defined entirely by some permutation vector q(i). It has a corresponding inverse permutation vector qinv(i). One could code the required loop as y[qinv(i)] = x[i] or as y[i]=x[q(i)] where the former reads from x stride one and the latter writes to y stride one.

Ideally one could always code both possibilities, profile them under representative conditions, and choose the faster version. Pretend you could only code one version-- which access pattern would you always anticipate being faster based on the behavior of modern memory architectures? Does working in a threaded environment change your response?

like image 551
Rhys Ulerich Avatar asked Jan 26 '12 15:01

Rhys Ulerich


2 Answers

Access pattern, that you name "writes stride one" (y[i]=x[q(i)]), is usually faster.

If memory is cached and your data pieces are smaller than cache line, this access pattern requires less memory bandwidth.

It is usual for modern processors to have more load execution units, than store units. And next Intel architecture, named Haswell, supports only GATHER instruction, while SCATTER is not yet in their plans. All this is also in favor of "writes stride one" pattern.

Working in a threaded environment does not change this.

like image 168
Evgeny Kluev Avatar answered Nov 12 '22 00:11

Evgeny Kluev


I'd like to share results of my simple benchmarks. Suppose we have two square NxN matrices A and B of doubles, and we want to perform a copy with a transposition:

A = transpose(B)

Algorithms:

  1. Two nested loops such that reads are contiguous and writes are strided.
  2. Two nested loops such that reads are strided and writes are contiguous.
  3. Sequential MKL's mkl_domatcopy.

Copy without transposition is used as a baseline. Values of N are taken to be 2^K + 1 to mitigate cache associativity effects.

Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native) and Intel MKL 2019.0.1:

Intel Core i7-4770

Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native) and Intel MKL 2017.0.1:

Intel Xeon E5-2650 v3

Numbers and C++ source code

like image 44
Evg Avatar answered Nov 12 '22 01:11

Evg