I was doing a little speed testing in c++ (MSVS) and got a strange result. I was testing the speed of using one for loop vs multiple nested for loops. Here is the code:
double testX = 0;
// Single loop executes in roughly 0.04 seconds
for( int i = 0; i < 27000000; i++ ){
testX += 1;
}
// Nested loop executes in roughly 0.03 seconds
for( int x = 0; x < 300; x++ ){
for( int y = 0; y < 300; y++ ){
for( int z = 0; z < 300; z++ ){
testX += 1;
}
}
}
As you can see, the speed difference is fairly obvious. I have run this many times, and those are the average times I am seeing (these are timed using glfwGetTime()).
So my question is: why? Is my test method inadequate? Am I using too few loops? I have tried searching google, and the only similar question I could find related his problem to cache coherency, but since these are empty for loops, I didn't think it would really have an effect.
Any help is appriciated :)
Edit: Thanks to the comments, I realized that using empty for loops probably wasn't the best way of testing things... So I have updated my code to do some (very) simple operations to a double. I also am compiling in release mode. However, though the two methods are a lot more similar in times, the second method is still slightly faster.
And yes, this is all the test code (minus the timing/output functions, but those aren't exactly specific to the question).
Nested loops are not faster than single loops.
Microsoft BASIC had a nesting limit of 8. @Davislor: The error message refers to the stack size in the compiler, which is also a program, and which is recursively processing the nested-looped construct.
Nested loops are frequently (but not always) bad practice, because they're frequently (but not always) overkill for what you're trying to do. In many cases, there's a much faster and less wasteful way to accomplish the goal you're trying to achieve.
Most likely neither are measurably faster and they may get optimized to the same exact code depending on the compiler.
The compiler won't "optimize" the loops away when the testX variable is used somewhere later in the code. When I just add one line to the code to output testX, the results are as follows:
single for loop: 1.218 ms
nested for loop: 1.218 ms
This pretty much shows that the compiler converts the nested loop into a single loop, whenever possible. The loop index can be used to prevent that optimisation:
Modifying the code this way
for( int i = 0; i < 27000000; i++ ){
testX += i;
}
and
for( int x = 0; x < 300; x++ ){
testX += x;
for( int y = 0; y < 300; y++ ){
testX += y;
for( int z = 0; z < 300; z++ ){
testX += z;
}
}
}
will add a little overhead to the nested loop but the execution time shows
single for loop: 1.224 ms
nested for loop: 1.226 ms
The times given here are averaged over 30.000 loop runs.
Note: The testX += x;
only contributes 1 in 90000 and the testX += x;
only contributes 1 in 300. Thus the two sections above remain comparable.
Nested loops are not much slower than single loops but your observation that they are faster is not true.
And: The times you show are about 40 times the times I observed. I'd suggest to carefully inspect the compiler settings since I ran the test on a medium speed hardware. Maybe the results of glfwGetTime()
are questionable and this is the main reason for your question. Have you tried to use another timing scheme?
Edit: In order to prevent compiler optimisation the loop limit can be choosen to be non constant:
int lmt = rand() % 1 + 300; // random value 300 or 301
int big_lmt = lmt * lmt * lmt; // random value 27000000 or 27270901
for( int i = 0; i < big_lmt; i++ ){
testX += i;
}
for( int x = 0; x < lmt; x++ ){
testX += x;
for( int y = 0; y < lmt; y++ ){
testX += y;
for( int z = 0; z < lmt; z++ ){
testX += z;
}
}
}
This avoids compiler predictability.
Results (for a lmt = 300
case to be comparable):
single for loop: 1.213 ms
nested for loop: 1.216 ms
Result:
If you don't use your for
variables (x
,y
,z
) inside your for
loop, a smart compiler could (and should) convert your second form in a single for
loop without nesting. Unless you prevent such compiler optimization removing static predictability by having the user input the x
,y
,z
values at runtime from stdin
, or reading from some stream, etc..
Furthermore, if you don't do something with your testX
variable (say, printing it to stdout
), a smart compiler could (and should) optimize it away, i.e. remove the dead code altogether.
So what I'm saying is that the benchmark, as it is right now, is somehow ill-formed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With