Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference in CPU time for two similar lines

There is a while loop in my program, where IterZNext, IterZ are pointers to nodes in a list. The nodes in the list are of type struct with a field called "Index".

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex1 = IterZNext->Index; // line (*)
        NextIndex = IterZNext->Index;  // line (**)
        IterZNext->Index;
    }
}

When I profiled my program, I found the line (*)

NextIndex1 = IterZNext->Index;

consumes most of CPU time (2.193s), while the line (**)

NextIndex = IterZNext->Index;

which is all most the same with the line (*) only uses 0.093s. I used the Intel VTune Amplifier to see the assembly of these two lines, which is as follows:

Address Line    Assembly                   CPU Time Instructions Retired
Line (*):
0x1666  561 mov eax, dword ptr [ebp-0x44]   0.015s  50,000,000
0x1669  561 mov ecx, dword ptr [eax+0x8]        
0x166c  561 mov dword ptr [ebp-0x68], ecx   2.178s  1,614,000,000

Line (**):
0x166f  562 mov byte ptr [ebp-0x155], 0x1   0.039s  80,000,000
0x1676  562 mov eax, dword ptr [ebp-0x44]   0.027s  44,000,000
0x1679  562 mov ecx, dword ptr [eax+0x8]        
0x167c  562 mov dword ptr [ebp-0x5c], ecx   0.026s  94,000,000

If I change the order of the line () and the line (*), then the program changes to

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex = IterZNext->Index;  // line (**)
                NextIndex1 = IterZNext->Index; // line (*)
        IterZNext->Index;
    }
}

and the result for assembly changes to

Address Line    Assembly    CPU Time    Instructions Retired
Line (**):
0x1666  560 mov byte ptr [ebp-0x155], 0x1   0.044s  84,000,000
0x166d  560 mov eax, dword ptr [ebp-0x44]   0.006s  2,000,000
0x1670  560 mov ecx, dword ptr [eax+0x8]    0.001s  4,000,000
0x1673  560 mov dword ptr [ebp-0x5c], ecx   1.193s  1,536,000,000

Line (*):
0x1676  561 mov eax, dword ptr [ebp-0x44]   0.052s  128,000,000
0x1679  561 mov ecx, dword ptr [eax+0x8]        
0x167c  561 mov dword ptr [ebp-0x68], ecx   0.034s  112,000,000

In this case, line (*) uses most of CPU time (1.245s) while line () only uses 0.086s.

Could someone tell me: (1) Why does it take so long to make the first assignment? Notice that the line zz=xx+yy only uses 0.058s. Is this related to the cache misses? since all nodes in the list are dynamically genereated. (2) Why is there huge difference in CPU time between this two lines?

Thanks!

like image 793
utrbc Avatar asked Jul 07 '11 19:07

utrbc


2 Answers

All modern CPUs are superscaler & out-of-order - which means that instructions are not actually executed in the order of the assembly, and there isn't really such thing as the current PC - there are many 10s of instructions in flight and executing at once.

Therefore any sampling information a CPU reports is just a rough area the CPU was executing - it was executing the instruction it indicated when the sampling interrupt went off; but it was also executing all the other in-flight ones!

However people have got used to (and expect) profiling tools to tell them exactly which individual instruction the CPU is currently running - so when the sampling interrupt triggers the CPU essentially picks one of the many active instructions to be the 'current' one.

like image 140
DaveR Avatar answered Sep 19 '22 00:09

DaveR


CPU line caching is probably the reason. accessing [ebp-0x5c] also brings into the cache [ebp-0x68], which then will be fetched much faster (for the second case, vice versa for the first).

like image 41
littleadv Avatar answered Sep 17 '22 00:09

littleadv