look at these 2 loops
const int arrayLength = ...
Version 0
public void RunTestFrom0()
{
int sum = 0;
for (int i = 0; i < arrayLength; i++)
for (int j = 0; j < arrayLength; j++)
for (int k = 0; k < arrayLength; k++)
for (int l = 0; l < arrayLength; l++)
for (int m = 0; m < arrayLength; m++)
{
sum += myArray[i][j][k][l][m];
}
}
Version 1
public void RunTestFrom1()
{
int sum = 0;
for (int i = 1; i < arrayLength; i++)
for (int j = 1; j < arrayLength; j++)
for (int k = 1; k < arrayLength; k++)
for (int l = 1; l < arrayLength; l++)
for (int m = 1; m < arrayLength; m++)
{
sum += myArray[i][j][k][l][m];
}
}
Version 2
public void RunTestFrom2()
{
int sum = 0;
for (int i = 2; i < arrayLength; i++)
for (int j = 2; j < arrayLength; j++)
for (int k = 2; k < arrayLength; k++)
for (int l = 2; l < arrayLength; l++)
for (int m = 2; m < arrayLength; m++)
{
sum += myArray[i][j][k][l][m];
}
}
Results for arrayLength=50
are (average from multiple sampling compiled X64):
if we make arrayLength=45
then
why:
myArray
of each dimension = arrayLength
, I initialized it in the beginning and the time taken is excluded. The value is 1. So sum
gives the total loops.Now I discard myArray
completely, sum++
instead, and added GC.Collect()
public void RunTestConstStartConstEnd()
{
int sum = 0;
for (int i = constStart; i < constEnd; i++)
for (int j = constStart; j < constEnd; j++)
for (int k = constStart; k < constEnd; k++)
for (int l = constStart; l < constEnd; l++)
for (int m = constStart; m < constEnd; m++)
{
sum++;
}
}
The reason a lot of for loops start at 0 is because they're looping over arrays, and in most languages (including JavaScript) arrays start at index 0."
For loop (forward and reverse) The traditional for loop is the fastest, so you should always use that right? Not so fast - performance is not the only thing that matters. Code Readability is usually more important, so default to the style that fits your application.
Python for loop index start at 0 In Python by default, the range starts from 0 and ends at the value of the passed argument as -1. In this example, we do not iterate items through the list.
Conclusions. List comprehensions are often not only more readable but also faster than using “for loops.” They can simplify your code, but if you put too much logic inside, they will instead become harder to read and understand.
Update
This appears to me to be a result of an unsuccessful attempt at optimization by the jitter, not the compiler. In short, if the jitter can determine the lower bound is a constant it will do something different which turns out to actually be slower. The basis for my conclusions takes some proving, so bear with me. Or go read something else if you're not interested!
I concluded this after testing four different ways to set the lower bound of the loop:
The compiled intermediate language for all four versions of the looping section is almost identical. The only difference is that in version 1 the lower bound is loaded with the command ldc.i4.#
, where #
is 0, 1, 2, or 3. That stands for load constant. (See ldc.i4 opcode). In all other versions, the lower bound is loaded with ldloc
. This is true even in case 3, where the compiler could infer that lowerBound
is really a constant.
The resulting performance is not constant. Version 1 (explicit constant) is slower than version 2 (run-time argument) along similar lines as found by the OP. What is very interesting is that version 3 is also slower, with comparable times to version 1. So even though the IL treats the lower bound as a variable, the jitter appears to have figured out that the value never changes, and substitutes a constant as in version 1, with the corresponding performance reduction. In version 4 the jitter can't infer what I know -- that Confuser
is actually an identity function -- and so it leaves the variable as a variable. The resulting performance is the same as the command line argument version (2).
My theory on the cause of the performance difference: The jitter is aware and makes use of the fine details of actual processor architecture. When it decides to use a constant other than 0
, it has to actually go fetch that literal value from some storage which is not in the L2 cache. When it is fetching a frequently used local variable it instead reads its value from the L2 cache, which is insanely fast. Normally it doesn't make sense to be taking up room in the precious cache with something as dumb as a known literal integer value. In this case we care more about read time than storage, though, so it has an undesired impact on performance.
Here is the full code for the version 2 (command line arg):
class Program {
static void Main(string[] args) {
List<double> testResults = new List<double>();
Stopwatch sw = new Stopwatch();
int upperBound = int.Parse(args[0]) + 1;
int tests = int.Parse(args[1]);
int lowerBound = int.Parse(args[2]); // THIS LINE CHANGES
int sum = 0;
for (int iTest = 0; iTest < tests; iTest++) {
sum = 0;
GC.Collect();
sw.Reset();
sw.Start();
for (int lvl1 = lowerBound; lvl1 < upperBound; lvl1++)
for (int lvl2 = lowerBound; lvl2 < upperBound; lvl2++)
for (int lvl3 = lowerBound; lvl3 < upperBound; lvl3++)
for (int lvl4 = lowerBound; lvl4 < upperBound; lvl4++)
for (int lvl5 = lowerBound; lvl5 < upperBound; lvl5++)
sum++;
sw.Stop();
testResults.Add(sw.Elapsed.TotalMilliseconds);
}
double avg = testResults.Average();
double stdev = testResults.StdDev();
string fmt = "{0,13} {1,13} {2,13} {3,13}"; string bar = new string('-', 13);
Console.WriteLine();
Console.WriteLine(fmt, "Iterations", "Average (ms)", "Std Dev (ms)", "Per It. (ns)");
Console.WriteLine(fmt, bar, bar, bar, bar);
Console.WriteLine(fmt, sum, avg.ToString("F3"), stdev.ToString("F3"),
((avg * 1000000) / (double)sum).ToString("F3"));
}
}
public static class Ext {
public static double StdDev(this IEnumerable<double> vals) {
double result = 0;
int cnt = vals.Count();
if (cnt > 1) {
double avg = vals.Average();
double sum = vals.Sum(d => Math.Pow(d - avg, 2));
result = Math.Sqrt((sum) / (cnt - 1));
}
return result;
}
}
For version 1: same as above except remove lowerBound
declaration and replace all lowerBound
instances with literal 0
, 1
, 2
, or 3
(compiled and executed separately).
For version 3: same as above except replace lowerBound declaration with
int lowerBound = 0; // or 1, 2, or 3
For version 4: same as above except replace lowerBound declaration with
int lowerBound = Ext.Confuser<int>(0); // or 1, 2, or 3
Where Confuser
is:
public static T Confuser<T>(T d) {
decimal d1 = (decimal)Convert.ChangeType(d, typeof(decimal));
List<decimal> L = new List<decimal>() { d1, d1 };
decimal d2 = L.Average();
if (d1 - d2 < 0.1m) {
return (T)Convert.ChangeType(d2, typeof(T));
} else {
// This will never actually happen :)
return (T)Convert.ChangeType(0, typeof(T));
}
}
Results (50 iterations of each test, in 5 batches of 10):
1: Lower bound hard-coded in all loops:
Program Iterations Average (ms) Std Dev (ms) Per It. (ns)
-------- ------------- ------------- ------------- -------------
Looper0 345025251 267.813 1.776 0.776
Looper1 312500000 344.596 0.597 1.103
Looper2 282475249 311.951 0.803 1.104
Looper3 254803968 282.710 2.042 1.109
2: Lower bound supplied at command line:
Program Iterations Average (ms) Std Dev (ms) Per It. (ns)
-------- ------------- ------------- ------------- -------------
Looper 345025251 269.317 0.853 0.781
Looper 312500000 244.946 1.434 0.784
Looper 282475249 222.029 0.919 0.786
Looper 254803968 201.238 1.158 0.790
3: Lower bound hard-coded but copied to local variable:
Program Iterations Average (ms) Std Dev (ms) Per It. (ns)
-------- ------------- ------------- ------------- -------------
LooperX0 345025251 267.496 1.055 0.775
LooperX1 312500000 345.614 1.633 1.106
LooperX2 282475249 311.868 0.441 1.104
LooperX3 254803968 281.983 0.681 1.107
4: Lower bound hard-coded but ground through Confuser:
Program Iterations Average (ms) Std Dev (ms) Per It. (ns)
-------- ------------- ------------- ------------- -------------
LooperZ0 345025251 266.203 0.489 0.772
LooperZ1 312500000 241.689 0.571 0.774
LooperZ2 282475249 219.533 1.205 0.777
LooperZ3 254803968 198.308 0.416 0.778
That is an enourmous array. For all practical purposes you are testing how long it takes your operating system to fetch the values of each element from memory, not to compare whether j
, k
, etc are less than arrayLength
, to increment the counters, and increment your sum. The latency to fetch those values has little to do with the runtime or jitter per se and a lot to do with whatever else happens to be running on your system as a whole and the current compression and organization of the heap.
In addition, because your array is taking up so much room and being accessed frequently it's quite possible that garbage collection is running during some of your test iterations, which would completely inflate the apparent CPU time.
Try doing your test without the array lookup -- just add 1 (sum++
) and then see what happens. To be even more thorough, call GC.Collect()
just before each test to avoid a collection during the loop.
I think Version 0 is faster because the compiler generates a special code without range checking in that case. See http://msdn.microsoft.com/library/ms973858.aspx (section Range Check Elimination)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With