Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is there a performance penalty for nested subroutines in Delphi?

Tags:

delphi

A static analyzer we use has a report that says:

Subprograms with local subprograms (OPTI7)

This section lists subprograms that themselves have local subprograms. Especially when these subprograms share local variables, it can have a negative effect on performance.

This guide says:

Do not use nested routines Nested routines (routines within other routines; also known as "local procedures") require some special stack manipulation so that the variables of the outer routine can be seen by the inner routine. This results in a good bit of overhead. Instead of nesting, move the procedure to the unit scoping level and pass the necessary variables - if necessary by reference (use the var keyword) - or make the variable global at the unit scope.

We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.

What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?

like image 513
afarah Avatar asked Apr 24 '19 14:04

afarah


1 Answers

tl;dr

  • There are extra push/pops for nested subroutines
  • Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines
  • Inlining results in the same code being generated for both nested and normal subroutines
  • For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off

I wrote a little test to determine this, where GetRTClock is measuring the current time with a precision of 1ns:

function subprogram_main(z : Integer) : Int64;
var
  n : Integer;
  s : Int64;

  function subprogram_aux(n, z : Integer) : Integer;
  var
    i : Integer;
  begin
    // Do some useless work on the aux program
    for i := 0 to n - 1 do begin
      if (i > z) then
        z := z + i
      else
        z := z - i;
    end;
    Result := z;
  end;
begin
  s := GetRTClock;

  // Do some minor work on the main program
  n := z div 100 * 100 + 100;
  // Call the aux program
  z := subprogram_aux(n, z);

  Result := GetRTClock - s;
end;

function normal_aux(n, z : Integer) : Integer;
var
  i : Integer;
begin
    // Do some useless work on the aux program
  for i := 0 to n - 1 do begin
    if (i > z) then
      z := z + i
    else
      z := z - i;
  end;
  Result := z;
end;

function normal_main(z : Integer) : Int64;
var
  n : Integer;
  s : Int64;
begin
  s := GetRTClock;

  // Do some minor work on the main program
  n := z div 100 * 100 + 100;
  // Call the aux program
  z := normal_aux(n, z);

  Result := GetRTClock - s;
end;

This compiles to:

subprogram_main

MyFormU.pas.41: begin
005CE7D0 55               push ebp
005CE7D1 8BEC             mov ebp,esp
005CE7D3 83C4E0           add esp,-$20
005CE7D6 8945FC           mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55               push ebp
005CE7F9 8B55FC           mov edx,[ebp-$04]
005CE7FC 8B45EC           mov eax,[ebp-$14]
005CE7FF E880FFFFFF       call subprogram_aux
005CE804 59               pop ecx
005CE805 8945FC           mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...

normal_main

MyFormU.pas.70: begin
005CE870 55               push ebp
005CE871 8BEC             mov ebp,esp
005CE873 83C4E0           add esp,-$20
005CE876 8945FC           mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC           mov edx,[ebp-$04]
005CE89B 8B45EC           mov eax,[ebp-$14]
005CE89E E881FFFFFF       call normal_aux
005CE8A3 8945FC           mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...

subprogram_aux:

MyFormU.pas.31: begin
005CE784 55               push ebp
005CE785 8BEC             mov ebp,esp
005CE787 83C4EC           add esp,-$14
005CE78A 8955F8           mov [ebp-$08],edx
005CE78D 8945FC           mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC           mov eax,[ebp-$04]
005CE793 48               dec eax
005CE794 85C0             test eax,eax
005CE796 7C29             jl $005ce7c1
005CE798 40               inc eax
005CE799 8945EC           mov [ebp-$14],eax
005CE79C C745F000000000   mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0           mov eax,[ebp-$10]
005CE7A6 3B45F8           cmp eax,[ebp-$08]
005CE7A9 7E08             jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0           mov eax,[ebp-$10]
005CE7AE 0145F8           add [ebp-$08],eax
005CE7B1 EB06             jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0           mov eax,[ebp-$10]
005CE7B6 2945F8           sub [ebp-$08],eax

normal_aux:

MyFormU.pas.55: begin
005CE824 55               push ebp
005CE825 8BEC             mov ebp,esp
005CE827 83C4EC           add esp,-$14
005CE82A 8955F8           mov [ebp-$08],edx
005CE82D 8945FC           mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC           mov eax,[ebp-$04]
005CE833 48               dec eax
005CE834 85C0             test eax,eax
005CE836 7C29             jl $005ce861
005CE838 40               inc eax
005CE839 8945EC           mov [ebp-$14],eax
005CE83C C745F000000000   mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0           mov eax,[ebp-$10]
005CE846 3B45F8           cmp eax,[ebp-$08]
005CE849 7E08             jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0           mov eax,[ebp-$10]
005CE84E 0145F8           add [ebp-$08],eax
005CE851 EB06             jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0           mov eax,[ebp-$10]
005CE856 2945F8           sub [ebp-$08],eax

The only difference is one push and one pop. What happens if we turn on optimizations?

MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3             mov edx,ebx
005CE7C7 8BC6             mov eax,esi
005CE7C9 E8B6FFFFFF       call subprogram_aux

MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3             mov edx,ebx
005CE82F 8BC6             mov eax,esi
005CE831 E8B6FFFFFF       call normal_aux

Both compile exactly to the same thing.

What happens when inlining?

MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3             mov edx,ebx
005CE806 8BC8             mov ecx,eax
005CE808 49               dec ecx
005CE809 85C9             test ecx,ecx
005CE80B 7C11             jl $005ce81e
005CE80D 41               inc ecx
005CE80E 33C0             xor eax,eax
005CE810 3BD0             cmp edx,eax
005CE812 7D04             jnl $005ce818
005CE814 03D0             add edx,eax
005CE816 EB02             jmp $005ce81a
005CE818 2BD0             sub edx,eax
005CE81A 40               inc eax
005CE81B 49               dec ecx
005CE81C 75F2             jnz $005ce810

subprogram_main:

MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3             mov edx,ebx
005CE7AA 8BC8             mov ecx,eax
005CE7AC 49               dec ecx
005CE7AD 85C9             test ecx,ecx
005CE7AF 7C11             jl $005ce7c2
005CE7B1 41               inc ecx
005CE7B2 33C0             xor eax,eax
005CE7B4 3BD0             cmp edx,eax
005CE7B6 7D04             jnl $005ce7bc
005CE7B8 03D0             add edx,eax
005CE7BA EB02             jmp $005ce7be
005CE7BC 2BD0             sub edx,eax
005CE7BE 40               inc eax
005CE7BF 49               dec ecx
005CE7C0 75F2             jnz $005ce7b4

Again, no difference.

I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:

constructor TForm1.Create(AOwner: TComponent);
const
  c_nSamples = 60;
  rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
  subprogram_gt_ns : Int64;
  normal_gt_ns     : Int64;
  rnd_input        : Integer;
  i                : Integer;
begin
  inherited Create(AOwner);

  normal_gt_ns     := 0;
  subprogram_gt_ns := 0;

  rnd_input        := Random(1000);

  for i := 0 to c_nSamples - 1 do
    if (rnd_sample[i] = 1) then
      Inc(subprogram_gt_ns, subprogram_main(rnd_input))
    else
      Inc(normal_gt_ns, normal_main(rnd_input));

  OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;

There is no significant difference even with optimizations turned off:

Debug Output:  Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)

Finally, both texts that warn about performance mention something about shared local variables.

If we do not pass z to subprogram_aux, instead access it directly, we get:

MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55               push ebp
005CE7D3 8BC3             mov eax,ebx
005CE7D5 E8AAFFFFFF       call subprogram_aux
005CE7DA 59               pop ecx
005CE7DB 8945FC           mov [ebp-$04],eax

Even with optimizations turned on.

like image 89
afarah Avatar answered Nov 08 '22 23:11

afarah