I have an algorithm that does some computations on elements of an array. I'd like to re-use the input-data buffer to write the results into it. In terms of the data traversal pattern, it looks almost exactly like this (the only other thing happening in that for-loop are increments to some pointers and counting variables):
int *inputData = /*input data is here */;
for(int i=0;i<some_value;++i)
{
int result = do_some_computations(*inputData);
*inputData = result;
++inputData;
}
Now the interesting part: inputData contains about six million elements. If I comment out the write to the inputData array, so the algorithm looks basically like this:
int *inputData = /*input data is here */;
for(int i=0;i<some_value;++i)
{
int result = do_some_computations(*inputData);
// *inputData = result;
++inputData;
}
The algorithm, over a series of ~100 measurements, takes on average about 7 milliseconds. However, if I leave the write in, the algorithm takes about 55 milliseconds. Writing "*inputData = do_some_computations(*inputData);" instead of the way it is now makes no difference in performance. Using a separate outputBuffer makes no difference as well.
This is bad. The performance of this algorithm is absolutely critical to the requirements of the program. I was very happy with 7ms, however I am very unhappy with 55 ms.
Why does this single write-back cause such a large overhead, and how can I fix it?
Your code is being optimised to nothing in the non-write back version. To show this, assuming a 5GHz single core CPU then:-
7ms = 35,000,000 cycles
6 million items = 35/6 = 5.8 cycles per item = not a lot of work being done
For the slow version:-
55ms = 275,000,000 cycles
6 million items = 275/6 = 45.8 cycles per item = far more work per item
If you want to verify this, look at the assembly output from the compiler.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With