This astonished me, because I always thought that loop
should have some inside optimization.
Here are the experiments I did today. I was using Microsoft Visual Studio 2010. My operation system is 64 bit Windows 8. My questions are at the end.
First experiment:
Platform: Win32
Mode: Debug (to disable optimization)
begin = clock();
_asm
{
mov ecx, 07fffffffh
start:
loop start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 3.583
(The number changes a little with each run, but it's morally the same size.)
Second experiment:
Platform: Win32
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
start:
dec ecx
jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 0.903
Third and fourth experiment:
Just change the platform to x64. Since VC++ does not support 64 bit inline assembly, I have to put the loop in another *.asm
file. But finally the results are the same.
And from this point I begin to use my brain - loop
is 4 times slower than dec ecx, jnz start
, and the only difference between them, AFAIK, is that dec ecx
changes flags while loop
doesn't. In order to imitate this keep of flags, I did the
Fifth experiment:
Platform: Win32 (in the following I always suppose that the platform has no effect on the result)
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
pushf
start:
popf
; do the loop here
pushf
dec ecx
jnz start
popf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 22.134
This is understandable, because pushf
and popf
have to play with the memory. But, let's say, for example, that the register eax
is not to be kept at the end of the loop (which can be achieved by arranging the registers better), and that the flag OF
is not needed in the loop (this simplifies things since OF
is not in the lower 8 bits of flag
), then we may use lahf
and sahf
to keep the flags, so I did the
Sixth experiment:
Platform: Win32
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
lahf
start:
sahf
; do the loop here
lahf
dec ecx
jnz start
sahf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 1.933
This is still much better than using loop
directly, right?
And the last experiment I did is to try to also keep the OF
flag.
Seventh experiment:
Platform: Win32
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
start:
inc al
sahf
; do the loop here
lahf
mov al, 0FFh
jo dec_ecx
mov al, 0
dec_ecx:
dec ecx
jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 3.612
This result is the worst case, i.e. OF
is not set at each loop. And it is almost the same as using loop
directly ...
So my questions are:
Am I right that, the ONLY advantage of using loop is that it takes care of the flags (actually only the 5 of them that dec
has effect on)?
Is there a longer form of lahf
and sahf
which also moves OF
, so that we may totally get rid of loop
?
Historically, on the 8088 and 8086 processors, LOOP
was an optimization because it only took one cycle longer than a conditional branch, whereas putting a DEC CX before a branch would cost three or four cycles (depending upon the state of the prefetch queue).
Today's processors, however, work very differently from the 8086. For a few processor generations, even though manufacturers have made machines that can process correctly essentially all of the documented instructions that the 8088/8086 or its descendants have ever possessed, they've focused their energies on enhancing only the performance of the most useful instructions. For a variety of reasons, the amount of circuitry Intel or AMD would have to add to a modern CPU to make the LOOP
instruction work as efficiently as DEC CX/JNZ
would likely exceed the total amount of circuitry in the entire 8086, probably by a huge margin. Rather than increase the complexity of their high-performance CPU, the manufacturers include a much simpler, but slower, processing unit which can handle "obscure" instructions. While a high-performance CPU will need lots of circuitry to allow the execution of multiple instructions to overlap except when later instructions need results from earlier computations (and must wait until they're available), an "obscure instructions processing unit" can avoid the need for such circuitry by simply executing instructions one at a time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With