improve locality and decrease Cache pollution in a medical image reconstruction implementation

Question

I'm doing a research for my University related to an Image reconstruction algorithm for medical usage.

I'm stuck in something up to 3 weeks, I need to improve the performance of the following code:

for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
  LOR_X = P.symmLOR[lor].x;
  LOR_Y = P.symmLOR[lor].y;
  LOR_XY = P.symmLOR[lor].xy;
  lor_z = P.symmLOR[lor].z;
  LOR_Z_X = P.symmLOR[lor_z].x;
  LOR_Z_Y = P.symmLOR[lor_z].y;
  LOR_Z_XY = P.symmLOR[lor_z].xy;  

  s0 = P.a2r[lor];
  s1 = P.a2r[lor+1];

  for (s=s0; s < s1; s++)
  {
    pixel     = P.a2b[s];
    v         = P.a2p[s]; 

    b[lor]    += v * x[pixel];

    p          = P.symm_Xpixel[pixel];
    b[LOR_X]  += v * x[p];

    p          = P.symm_Ypixel[pixel];
    b[LOR_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel];
    b[LOR_XY] += v * x[p];


    // do Z symmetry.
    pixel_z    = P.symm_Zpixel[pixel];
    b[lor_z]  += v * x[pixel_z];


    p          = P.symm_Xpixel[pixel_z];
    b[LOR_Z_X]  += v * x[p];


    p          = P.symm_Ypixel[pixel_z];
    b[LOR_Z_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel_z];
    b[LOR_Z_XY] += v * x[p];

   }

}

for anyone who wants to know, The code implements the MLEM forward function and all the variables are FLOAT.

After several tests, I had notice that the big delay was on this part of the code. (you know, the 90 - 10 rule).

Later, I used Papi (http://cl.cs.utk.edu/papi/) to measure L1D cache misses. As I thought, Papi confirm that the performance decreases due to higher amount of misses, particularly for the random access to b vector (huge in size).

Reading information on the Internet I just know two options to improve performance so far: improve data locality or decrease data pollution.

To do the first improvement, I'll try to change the code to be cache aware, just like was propossed by Ulrich Drepper on What every programmer should know about the memory (www.akkadia.org/drepper/cpumemory.pdf) A.1 Matrix multiplication.

I believe that blocking the SpMV (sparse matrix-vector Multiplication) will improve the performance.

On the other hand, every time the program tried to access to b vector, we had what is known as cache pollution.

Is there's a way to do a load a value from the b vector with SIMD instruction without using the Cache?

Also, it is possible to use a function like void _mm_stream_ps(float * p , __m128 a ) to store ONE float value on the vector b without polluting the Cache?

I can't use _mm_stream_ps because always store 4 floats but the access to the b vector is clearly random.

I'd hope to be clear in my dilemma.

More info: v is the column value of an Sparse Matrix store with CRS format. I realize that other optimization could be done if I tried to change CRS format to other, however, like I said before, I'd made several test for months and I know that the performance decrease is related to random access on vector b. from 400.000.000 L1D Misses I can go to 100~ Misses when I don't store in vector b.

Thanks.

Jens Gustedt · Accepted Answer

I'd say, at first try to help your compiler a bit.

Declare the bounds for the outer loop as const before the loop.
Declare all variables that you may (all the LOR_..) as local variables, something like: float LOR_X = P.symmLOR[lor].x; or size_t s0 = P.a2r[lor];
This also in particular for loop variables if you happen to have modern, C99 compliant, compiler: for (size_t s=s0; s < s1; s++)
Separate load and store for your b vector. The locations of the items that you access, there, do not depend on s. So create a local variable to hold the actual value for all the distinct cases that you handle before the inner loop, update these local variables inside the inner loop, and store the results after the inner loop.
Perhaps separate your inner loop in several. The index computations are relatively cheap, and then your system might better recognize the streaming access to some of your vectors.
Look at the assembler that your compiler produces and identify the code for the inner loop. Experiment a bit to move as many of the "far" load and stores out of that loop.

Edit: after re-reading gravitron's answer and your comment, the important thing here is really to declare the variables as local as possible and to check the assembler that the compiler succeeds in having the cache-missing loads and stores outside the inner loop.

gravitron · Answer

A simple optimization to reduce the random access on vector b would be to never write into vector b within the inner for loop.

Instead load all values needed from vector B into temporary variables, do the entire inner for loop while updating these temporary variables, then write the temporary variables back into vector B.

The temporary variables will at worst be located on the same cache lines, depending on your compiler and environment you may also hint for the compiler to use registers for these variables.

improve locality and decrease Cache pollution in a medical image reconstruction implementation

Tags:

c

optimization

caching

blocking

simd

Fbarisio

2 Answers

Jens Gustedt

gravitron

Recent Activity

Donate For Us

improve locality and decrease Cache pollution in a medical image reconstruction implementation

Tags:

c

optimization

caching

blocking

simd

Fbarisio

2 Answers

Jens Gustedt

gravitron

Related questions

Recent Activity

Donate For Us