For an assignment at school, I'm performing an intensive operation on a very large array of numbers. While benchmarking a single-threaded version operating on the whole array and comparing my results to my classmate's, I noticed some strange behavior.
The function is as follows:
int compute (char a[], int start, int end) { int sum = 0; int min = a[start]; int max = a[start]; for (int i = start; i < end; i++) { if (a[i] > max) max = a[i]; if (a[i] < min) min = a[i]; int cube = a[i] * a[i] * a[i]; sum += cube; } return sum; }
But my classmate's program is consistently running faster, often much faster. His code is identical, except for the order of the instructions in the loop body:
for (int i = start; i < end; i++) { int cube = a[i] * a[i] * a[i]; sum += cube; if (a[i] > max) max = a[i]; if (a[i] < min) min = a[i]; }
Here's the output comparing the runtime of each version with an input array of size 1,000,000,000 (initialized with random signed bytes):
Min/max first: sum = 5445493143089, min = -128, max = 127 Completed in 1.050268 sec Product-sum first: sum = 5445493143089, min = -128, max = 127 Completed in 1.010639 sec
We have inspected the generated assembly for both versions and noticed the same instructions present, simply ordered differently. To my knowledge, this shouldn't have as significant an effect as it does, but I could be wrong. (We also noticed that the registers used differed greatly, but this I especially doubt should have an effect.)
We encounter this behavior when compiling for both C (-std=c11
) and C++ (-std=c++11
).
Why is the order of those lines heavily affecting the behavior of the sequential program? We are also benchmarking a parallel version of the operation, and in contrast, its behavior is almost unchanged. I looked into memory reordering as a possible culprit, but that doesn't appear to be the problem since the parallel version is virtually unaffected (and there's no overlap in the partitions anyway).
Intensive back-to-back tests demonstrating the behavior. Product-sum is always faster than min/max, even in alternation and allowing for caching.
Super pipelining improves the performance by decomposing the long latency stages (such as memory access stages) of a pipeline into several shorter stages, thereby possibly increasing the number of instructions running in parallel at each cycle.
Theory says that : "With pipelining, the CPU begins executing a second instruction before the first instruction is completed. Pipelining results in faster processing because the CPU does not have to wait for one instruction to complete the machine cycle."
With out-of-order execution (OoO for short), the processor would issue each of the instructions in program order, and then enter a new pipeline stage called "read operands" during which instructions whose operands are available would move to the execution stage, regardless of their order in the program.
If we put explicit jumps into the code, you can see that the one with the conditionals at the end can avoid one jump most of the time. This is similar to the code that will actually be generated by the compiler.
First form, min/max first:
int i = lo; goto start; loop: i++; start: if (!(i < hi)) goto end; if (!(a[i] > ret.max)) goto label1; ret.max = a[i]; label1: if (!(a[i] < ret.min)) goto label2; ret.min = a[i]; label2: long long square = a[i] * a[i]; ret.sum += square; goto loop; end:
Second form, min/max last:
int i = lo; goto start; loop: i++; start: if (!(i < hi)) goto end; long long square = a[i] * a[i]; ret.sum += square; if (!(a[i] > ret.max)) goto label1; ret.max = a[i]; label1: if (!(a[i] < ret.min)) goto loop; ret.min = a[i]; goto loop; end:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With