A static analyzer we use has a report that says:
Subprograms with local subprograms (OPTI7)
This section lists subprograms that themselves have local subprograms. Especially when these subprograms share local variables, it can have a negative effect on performance.
This guide says:
Do not use nested routines Nested routines (routines within other routines; also known as "local procedures") require some special stack manipulation so that the variables of the outer routine can be seen by the inner routine. This results in a good bit of overhead. Instead of nesting, move the procedure to the unit scoping level and pass the necessary variables - if necessary by reference (use the var keyword) - or make the variable global at the unit scope.
We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.
What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?
tl;dr
push
/pop
s for nested subroutinesI wrote a little test to determine this, where GetRTClock
is measuring the current time with a precision of 1ns:
function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);
Result := GetRTClock - s;
end;
function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);
Result := GetRTClock - s;
end;
This compiles to:
subprogram_main
MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...
normal_main
MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...
subprogram_aux:
MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax
normal_aux:
MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax
The only difference is one push
and one pop
. What happens if we turn on optimizations?
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux
MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux
Both compile exactly to the same thing.
What happens when inlining?
MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810
subprogram_main:
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4
Again, no difference.
I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:
constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);
normal_gt_ns := 0;
subprogram_gt_ns := 0;
rnd_input := Random(1000);
for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));
OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;
There is no significant difference even with optimizations turned off:
Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)
Finally, both texts that warn about performance mention something about shared local variables.
If we do not pass z
to subprogram_aux
, instead access it directly, we get:
MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax
Even with optimizations turned on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With