I'm working on a homework assignment where I must manually optimize a nested loop (my program will be compiled with optimizations disabled). The goal of the assignment is to run the entire program in less than 6 seconds (extra credit for less than 4.5 seconds).
I'm only allowed to change a small block of code, and the starting point is such:
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Where ARRAY_SIZE
is 9973. This loop is contained within another loop that is run 200,000 times. This particular version runs in 16 seconds.
What I've done so far is change the implementation to unroll the loop and use pointers as my iterator:
(These declarations are not looped over 200,000 times)
register int unroll_length = 16;
register int *unroll_end = array + (ARRAY_SIZE - (ARRAY_SIZE % unroll_length));
register int *end = array + (ARRAY_SIZE -1);
register int *curr_end;
curr_end = end;
while (unroll_end != curr_end) {
sum += *curr_end;
curr_end--;
}
do {
sum += *curr_end + *(curr_end-1) + *(curr_end-2) + *(curr_end-3) +
*(curr_end-4) + *(curr_end-5) + *(curr_end-6) + *(curr_end-7) +
*(curr_end-8) + *(curr_end-9) + *(curr_end-10) + *(curr_end-11) +
*(curr_end-12) + *(curr_end-13) + *(curr_end-14) + *(curr_end-15);
}
while ((curr_end -= unroll_length) != array);
sum += *curr_end;
Using these techniques, I was able to get the execution down to 5.5 seconds, which will give me full credit. However; I sure do want to earn the extra credit, but I'm also curious what additional optimizations I can make that I might be overlooking?
Edit #1 (Adding outer loop)
srand(time(NULL));
for(j = 0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
for (i = 0; i < N_TIMES; i++) {
// inner loop goes here
if (sum != checksum)
printf("Checksum error!\n");
sum = 0;
}
To optimize nested loops it is necessary to consider both the inner loop and the outer loop performance, especially when the inner loop count is small for execution of each outer loop. In many typical DSP applications, loops comprise a majority of the number of cycles, or MIPS.
For the access to a an element you will have to do the index calculation yourself. But as one of the indices (j) is constant over the loop, you can compute the start element's index before the loop and inside just increment the index 1. That might get you some significant improvement.
you could try to store your variables in CPU register with :
register int *unroll_limit = array + (ARRAY_SIZE - (ARRAY_SIZE % 10));
register int *end = array + ARRAY_SIZE;
register int *curr;
and try with different size of manual loops to check when you maximize cache usage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With