I'm doing a research for my University related to an Image reconstruction algorithm for medical usage.
I'm stuck in something up to 3 weeks, I need to improve the performance of the following code:
for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
LOR_X = P.symmLOR[lor].x;
LOR_Y = P.symmLOR[lor].y;
LOR_XY = P.symmLOR[lor].xy;
lor_z = P.symmLOR[lor].z;
LOR_Z_X = P.symmLOR[lor_z].x;
LOR_Z_Y = P.symmLOR[lor_z].y;
LOR_Z_XY = P.symmLOR[lor_z].xy;
s0 = P.a2r[lor];
s1 = P.a2r[lor+1];
for (s=s0; s < s1; s++)
{
pixel = P.a2b[s];
v = P.a2p[s];
b[lor] += v * x[pixel];
p = P.symm_Xpixel[pixel];
b[LOR_X] += v * x[p];
p = P.symm_Ypixel[pixel];
b[LOR_Y] += v * x[p];
p = P.symm_XYpixel[pixel];
b[LOR_XY] += v * x[p];
// do Z symmetry.
pixel_z = P.symm_Zpixel[pixel];
b[lor_z] += v * x[pixel_z];
p = P.symm_Xpixel[pixel_z];
b[LOR_Z_X] += v * x[p];
p = P.symm_Ypixel[pixel_z];
b[LOR_Z_Y] += v * x[p];
p = P.symm_XYpixel[pixel_z];
b[LOR_Z_XY] += v * x[p];
}
}
for anyone who wants to know, The code implements the MLEM forward function and all the variables are FLOAT.
After several tests, I had notice that the big delay was on this part of the code. (you know, the 90 - 10 rule).
Later, I used Papi (http://cl.cs.utk.edu/papi/) to measure L1D cache misses. As I thought, Papi confirm that the performance decreases due to higher amount of misses, particularly for the random access to b vector (huge in size).
Reading information on the Internet I just know two options to improve performance so far: improve data locality or decrease data pollution.
To do the first improvement, I'll try to change the code to be cache aware, just like was propossed by Ulrich Drepper on What every programmer should know about the memory (www.akkadia.org/drepper/cpumemory.pdf) A.1 Matrix multiplication.
I believe that blocking the SpMV (sparse matrix-vector Multiplication) will improve the performance.
On the other hand, every time the program tried to access to b vector, we had what is known as cache pollution.
Is there's a way to do a load a value from the b vector with SIMD instruction without using the Cache?
Also, it is possible to use a function like void _mm_stream_ps(float * p , __m128 a ) to store ONE float value on the vector b without polluting the Cache?
I can't use _mm_stream_ps because always store 4 floats but the access to the b vector is clearly random.
I'd hope to be clear in my dilemma.
More info: v is the column value of an Sparse Matrix store with CRS format. I realize that other optimization could be done if I tried to change CRS format to other, however, like I said before, I'd made several test for months and I know that the performance decrease is related to random access on vector b. from 400.000.000 L1D Misses I can go to 100~ Misses when I don't store in vector b.
Thanks.
I'd say, at first try to help your compiler a bit.
const
before the loop.LOR_..
) as local variables,
something like:
float LOR_X = P.symmLOR[lor].x;
or
size_t s0 = P.a2r[lor];
for
(size_t s=s0; s < s1; s++)
b
vector. The locations of the items
that you access, there, do not depend
on s
. So create a local variable to
hold the actual value for all the
distinct cases that you handle
before the inner loop, update these local variables inside the inner
loop, and store the results after the
inner loop.Edit: after re-reading gravitron's answer and your comment, the important thing here is really to declare the variables as local as possible and to check the assembler that the compiler succeeds in having the cache-missing loads and stores outside the inner loop.
A simple optimization to reduce the random access on vector b would be to never write into vector b within the inner for loop.
Instead load all values needed from vector B into temporary variables, do the entire inner for loop while updating these temporary variables, then write the temporary variables back into vector B.
The temporary variables will at worst be located on the same cache lines, depending on your compiler and environment you may also hint for the compiler to use registers for these variables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With