I have some big arrays given by MATLAB to C++ (therefore I need to take them as they are) that needs casting and permuting (row-mayor, column mayor issues).
The array, imgaux
is double type has size size_proj[0]*size_proj[1]*size_proj[2]
and needs to be casted to float, changing some locations of values. A minimal example is as follows:
#include <time.h>
#include <stdlib.h>
int main(void){
int size_proj[3];
size_proj[0] = 512;
size_proj[1] = 512;
size_proj[2] = 360;
size_t num_byte_double = size_proj[0] * size_proj[1] * size_proj[2] * sizeof(double);
size_t num_byte_float = size_proj[0] * size_proj[1] * size_proj[2] * sizeof(float);
double *imgaux = (double*) malloc(num_byte_double);
float* img = (float*) malloc(num_byte_float);
clock_t begin, end;
double time_spent;
begin = clock();
for (int k = 0; k < size_proj[0]; k++)
for (int i = 0; i <size_proj[1]; i++)
for (int j = 0; j < size_proj[2]; j++)
img[i + k*size_proj[1] + j*size_proj[0] * size_proj[1]] = (float)imgaux[k + i*size_proj[0] + j*size_proj[0] * size_proj[1]];
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time permuting and casting the input %f", (float)time_spent);
free(imgaux);
free(img);
getchar();
}
However, this is a huge performance bottleneck, taking up to 6 seconds for big arrays (512*512*300).
I know that if instead of doing the 3D indexing part, I'd be doing
for (int k = 0; k < size_proj[0]*size_proj[1]*size_proj[3]; k++)
img[k]=(float)imgaux[k];
The code would be taking about 0.2 seconds to run. However, I need the "permutation" of the dimensions as in the first code snippet.
Is there a way I can speed up that code while still changing the values of place?
Ok, let's unravel your loop a little bit by precalculating things ASAP:
int max0 = size_proj[0];
int max1 = size_proj[1];
int max2 = size_proj[2];
for (int k = 0; k < max0; k++)
{
int kOffset1 = k*max1;
int kOffset2 = k;
for (int i = 0; i < max1; i++)
{
int iOffset1 = i;
int iOffset2 = i*max0;
for (int j = 0; j < max2; j++)
{
int jOffset1 = j*max0*max1;
int jOffset2 = j*max0*max1;
int idx1 = iOffset1 + jOffset1 + kOffset1;
int idx2 = iOffset2 + jOffset2 + kOffset2;
img[idx1] = (float)imgaux[idx2];
}
}
}
The calculation for jOffset1/2
seems to be suboptimal being on the lowest level of your nested loop. This always makes the idx1/2
value jump for max0*max1
every iteration. So let's move this to the highest level:
int max0 = size_proj[0];
int max1 = size_proj[1];
int max2 = size_proj[2];
for (int j = 0; j < max2; j++)
{
int jOffset1 = j*max0*max1;
int jOffset2 = j*max0*max1;
for (int k = 0; k < max0; k++)
{
int kOffset1 = k*max1;
int kOffset2 = k;
for (int i = 0; i < max1; i++)
{
int iOffset1 = i;
int iOffset2 = i*max0;
int idx1 = iOffset1 + jOffset1 + kOffset1;
int idx2 = iOffset2 + jOffset2 + kOffset2;
img[idx1] = (float)imgaux[idx2];
}
}
}
That already looks better. kOffset1/2
and iOffset1/2
can't be optimized anymore, but we still have unecessary values and declarations. Let's sum these up:
for (int j = 0; j < size_proj[2]; j++)
{
int jOffset = j*size_proj[0]*size_proj[1];
for (int k = 0; k < size_proj[0]; k++)
{
int kOffset1 = k*size_proj[1];
for (int i = 0; i < size_proj[1]; i++)
{
int iOffset2 = i*size_proj[0];
img[i + jOffset + kOffset1] = (float)imgaux[iOffset2 + jOffset + k];
}
}
}
I tried your updated MVCE with your loop and with mine (same system using MSVC14):
Yours:
Time permuting and casting the input 4.180000
Mine:
Time permuting and casting the input 0.704000
Hopefully I didn't mess anything up ;-)
As @BarryTheHatchet pointed out and as it is easily overseen in the comment section: Instead of using an array of 3 int
values for size_proj
you better use three const int
values.
Not using an array will remove complexity from your code (using descriptive names of course)
The use of const
will prevent you from accidentially changing values in complex calculation and may allow the compiler for better optimization.
As @paddy pointed out: You may replace the multiplications at the different levels of your nested loop with calculations by precalculating the step sizes.
I had tried this but there wasn't any real change in the multiplication version and step version....
const int jStep = size_proj[0] * size_proj[1];
const int jStepMax = size_proj[0] * size_proj[1] * size_proj[2];
const int kStep1 = size_proj[1];
const int kStep1Max = size_proj[0] * size_proj[1];
const int kStep2 = 1;
const int kStep2Max = size_proj[0];
const int iStep1 = 1;
const int iStep1Max = size_proj[1];
const int iStep2 = size_proj[0];
const int iStep2Max = size_proj[0] * size_proj[1];
for (int jOffset = 0; jOffset < jStepMax; jOffset += jStep)
{
for (int kOffset1 = 0, kOffset2=0; kOffset1 < kStep1Max && kOffset2 < kStep2Max; kOffset1+=kStep1, kOffset2+=kStep2)
{
for (int iOffset1 = 0, iOffset2 = 0; iOffset1 < iStep1Max && iOffset2 < iStep2Max; iOffset1 += iStep1, iOffset2 += iStep2)
{
img[iOffset1 + jOffset + kOffset1] = (float)imgaux[iOffset2 + jOffset + kOffset2];
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With