I have a program dealing with multiple nested loops, operating on a 3D domain:
unsigned int sX(m_sizeZ*m_sizeY);
unsigned int b(sX+m_sizeZ);
for(unsigned int i(1);i<m_sizeX-1;i++){
for(unsigned int j(1);j<m_sizeY-1;j++){
for(unsigned int k(1);k<m_sizeZ-1;k++){
m_r[b+k]=m_func[b+k]-m_cX*(m_data[b+k-sX]+m_data[b+k+sX]-2.0*m_data[b+k])
-m_cY*(m_data[b+k-m_sizeZ]+m_data[b+k+m_sizeZ]-2.0*m_data[b+k])
-m_cZ*(m_data[b+k-1]+m_data[b+k+1]-2.0*m_data[b+k]);
}
b+=m_sizeZ;
}
b+=2*m_sizeZ;
}
Where my arrays are double of size m_sizeX*m_sizeY*m_sizeZ.
I iterate this way because I don't want to touch the boundaries of the domain.
When compiling with (g++) -msse2 -ftree-vectorizer-verbose=2, I of course get the multiple nested loops note.
Is there any way I could use instead a single loop without (more or less) complicated check operations?
Thanks!
If your goal is good vectorization, it is probably best to just apply the same calculation to your edge points as to the inner points, only to reset them after you are done calculating all the points. Something like this:
unsigned int sX(m_sizeZ*m_sizeY);
unsigned int start = (1*m_sizeY + 1)*m_sizeZ + 1;
unsigned int end = ((m_sizeX - 1)*m_sizeY - 1)*m_sizeZ - 1;
//Do calculation for everything, including the edges.
for(unsigned int i = start; i < end; i++) {
m_r[i]=m_func[i]-m_cX*(m_data[i-sX]+m_data[i+sX]-2.0*m_data[i])
-m_cY*(m_data[i-m_sizeZ]+m_data[i+m_sizeZ]-2.0*m_data[i])
-m_cZ*(m_data[i-1]+m_data[i+1]-2.0*m_data[i]);
}
//Reset the edges.
for(unsigned x = 0; x < m_sizeX; x++) {
for(unsigned y = 0; y < m_sizeY; y++) {
m_r[x*sX + y*m_sizeZ] = m_data[x*sX + y*m_sizeZ];
m_r[x*sX + y*m_sizeZ + m_sizeZ-1] = m_data[x*sX + y*m_sizeZ + m_sizeZ-1];
}
}
for(unsigned x = 0; x < m_sizeX; x++) {
for(unsigned z = 0; z < m_sizeZ; z++) {
m_r[x*sX + z] = m_data[x*sX + z];
m_r[x*sX + (m_sizeY-1)*m_sizeZ + z] = m_data[x*sX + (m_sizeY-1)*m_sizeZ + z];
}
}
This is additional calculations that would be done, but it has two positive effects:
It is now very easy for your compiler to vectorize the first loop (which takes most of the time).
This approach virtually eliminates the edge problem that results from the fixed vector size: Since your vector unit handles several alligned(!) loop iterations in one, every edge in your calculations leads to two special iterations that need to be done. One at the start of a run to get the vector loop aligned, and one at the end to handle the leftovers of the vector loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With