Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is LOOP so slow? [duplicate]

Tags:

assembly

This astonished me, because I always thought that loop should have some inside optimization.

Here are the experiments I did today. I was using Microsoft Visual Studio 2010. My operation system is 64 bit Windows 8. My questions are at the end.

First experiment:

Platform: Win32
Mode: Debug (to disable optimization)

begin = clock();
_asm
{
    mov ecx, 07fffffffh
start:
    loop start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 3.583
(The number changes a little with each run, but it's morally the same size.)

Second experiment:

Platform: Win32
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
start:
    dec ecx
    jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 0.903

Third and fourth experiment:

Just change the platform to x64. Since VC++ does not support 64 bit inline assembly, I have to put the loop in another *.asm file. But finally the results are the same.

And from this point I begin to use my brain - loop is 4 times slower than dec ecx, jnz start, and the only difference between them, AFAIK, is that dec ecx changes flags while loop doesn't. In order to imitate this keep of flags, I did the

Fifth experiment:

Platform: Win32 (in the following I always suppose that the platform has no effect on the result)
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
    pushf
start:
    popf
; do the loop here
    pushf
    dec ecx
    jnz start
    popf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 22.134

This is understandable, because pushf and popf have to play with the memory. But, let's say, for example, that the register eax is not to be kept at the end of the loop (which can be achieved by arranging the registers better), and that the flag OF is not needed in the loop (this simplifies things since OF is not in the lower 8 bits of flag), then we may use lahf and sahf to keep the flags, so I did the

Sixth experiment:

Platform: Win32
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
    lahf
start:
    sahf
; do the loop here
    lahf
    dec ecx
    jnz start
    sahf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 1.933

This is still much better than using loop directly, right?

And the last experiment I did is to try to also keep the OF flag.

Seventh experiment:

Platform: Win32
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
start:
    inc al
    sahf
; do the loop here
    lahf
    mov al, 0FFh
    jo dec_ecx
    mov al, 0
dec_ecx:
    dec ecx
    jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 3.612

This result is the worst case, i.e. OF is not set at each loop. And it is almost the same as using loop directly ...

So my questions are:

  1. Am I right that, the ONLY advantage of using loop is that it takes care of the flags (actually only the 5 of them that dec has effect on)?

  2. Is there a longer form of lahf and sahf which also moves OF, so that we may totally get rid of loop?

like image 694
WhatsUp Avatar asked Feb 04 '14 22:02

WhatsUp


1 Answers

Historically, on the 8088 and 8086 processors, LOOP was an optimization because it only took one cycle longer than a conditional branch, whereas putting a DEC CX before a branch would cost three or four cycles (depending upon the state of the prefetch queue).

Today's processors, however, work very differently from the 8086. For a few processor generations, even though manufacturers have made machines that can process correctly essentially all of the documented instructions that the 8088/8086 or its descendants have ever possessed, they've focused their energies on enhancing only the performance of the most useful instructions. For a variety of reasons, the amount of circuitry Intel or AMD would have to add to a modern CPU to make the LOOP instruction work as efficiently as DEC CX/JNZ would likely exceed the total amount of circuitry in the entire 8086, probably by a huge margin. Rather than increase the complexity of their high-performance CPU, the manufacturers include a much simpler, but slower, processing unit which can handle "obscure" instructions. While a high-performance CPU will need lots of circuitry to allow the execution of multiple instructions to overlap except when later instructions need results from earlier computations (and must wait until they're available), an "obscure instructions processing unit" can avoid the need for such circuitry by simply executing instructions one at a time.

like image 78
supercat Avatar answered Oct 16 '22 19:10

supercat