Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance drop of DO loops by potential change of bound variables

In the following code, we pass two arrays to a subroutine and do some additional operations inside DO loops. Here, we consider three cases where different operations are made: Case 1 = no operation, Cases 2 and 3 = assignment of pointer variables.

!------------------------------------------------------------------------
module mymod
    implicit none
    integer, pointer :: n_mod
    integer :: nloop
contains

!.........................................................
subroutine test_2dim ( a, b, n )
    integer :: n
    real :: a(n,n), b(n,n)

    integer, pointer :: n_ptr
    integer i1, i2, iloop

    n_ptr => n_mod

    do iloop = 1, nloop

        do i2 = 1, n
        do i1 = 1, n
            b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop

             !(nothing here)    !! Case 1 : gfort => 3.6 sec, ifort => 3.2 sec
!            n_ptr = n          !! Case 2 : gfort => 15.9 sec, ifort => 6.2 sec
!            n_ptr = n_mod      !! Case 3 : gfort => 3.6 sec, ifort => 3.5 sec
        enddo
        enddo

    enddo
endsubroutine

endmodule

!------------------------------------------------------------------------
program main
    use mymod
    implicit none
    integer, target :: n
    real, allocatable :: a(:,:), b(:,:)

    nloop = 10000 ; n = 1000
    allocate( a( n, n ), b( n, n ) )
    a = 0.0 ; b = 0.0

    n_mod => n

    call test_2dim ( a, b, n )

    print *, a(n,n), b(n,n)   !! for check
end

Here, we note that this pointer is linked to the upper bounds of the DO loop via the module variable (n_mod). Thus, changing the pointer variables inside the loop should affect the behavior of the loops. But please note that we do not change the bounds in practice (just a copy of variable). gfortran 4.8 and ifort 14.0 with -O3 gave the timing listed above. Notably, Case 2 is very slow as compared to Case 1, despite that the net calculation seems not much different. I suspected that this might be because the compiler cannot determine whether the upper bound of the second loop (for i1) is changed by pointer assignment, so avoiding aggressive optimization. To verify this, I tested the following routine in place of test_2dim():

subroutine test_1dim ( a, b, n )
    integer :: n
    real :: a(n * n), b(n * n)

    integer, pointer :: n_ptr
    integer iloop, i

    n_ptr => n_mod

    do iloop = 1, nloop

        do i = 1, n * n
             b( i ) = a( i ) + b( i ) + iloop

             ! (nothing here)    !! Case 1 : gfort => 3.6 sec, ifort => 2.3 sec
             ! n_ptr = n         !! Case 2 : gfort => 15.9 sec, ifort => 6.0 sec
             ! n_ptr = n_mod     !! Case 3 : gfort => 3.6 sec, ifort => 6.1 sec
        enddo

    enddo
endsubroutine

Here the sole difference between test_1dim() and test_2dim() is that the arrays a and b are accessed by 1- or 2-dim indices (essentially no difference in the amount of calculation). Surprisingly, Case 2 also gave a slow result, even though there is only a single DO loop. Because Fortran DO loops determine the upper bound of the loop upon entry [Ref], I expected that test_1dim() would not be affected by pointer assignment, though this was not the case. So, is there any reasonable explanation for this behavior? (I hope that I am not making some big mistake that leads to this time difference.)


My motivation for this question: I have been using derived types extensively to specify multi-dimensional loops, for example

module Grid_mod
type Grid_t
    integer :: N1, N2, N3
endtype
....

subroutine some_calc ( vector, grid )
type(Grid_t) :: grid
....
do i3 = 1, grid % N3
do i2 = 1, grid % N2
do i1 = 1, grid % N1
    (... various operations...)
enddo
enddo
enddo

Up to now, I have not been paying much attention to whether Grid_t objects are given TARGET or POINTER attribute (assuming that it will have essentially no effect on the performance). However, I now think that this might lead to some performance drop if a compiler cannot determine whether the upper bounds are constant inside the loops (though I will never change the bounds in actual codes). So I would appreciate for any advice whether I should be more careful against using TARGET or POINTER attributes for bound variables (including derived-type components, as given by the grid object above).


Update

Following the suggestion by @francescalus, I have tried attaching "intent(in), value" to the dummy argument "n". The result is as follow:

test_1dim():
    Case 1: gfort => 3.6 s, ifort => 2.3 s
    Case 2: gfort => 3.6 s, ifort => 3.1 s
    Case 3: gfort => 3.6 s, ifort => 3.4 s 

test_2dim():
    Case 1: gfort => 3.7 s, ifort => 3.1 s
    Case 2: gfort => 3.7 s, ifort => 3.1 s
    Case 3: gfort => 3.7 s, ifort => 6.4 s

While ifort gives somewhat irregular result (6.4 s) for Case 3 in test_2dim(), all the other results show essentially the best performance. This strongly suggests that the treatment of bounds by the compiler does affect the performance (not due to the cost of pointer assignment). Because it seems important to tell the compiler that bounds are constant, I also tried copying the dummy argument n (here, not with "intent(in), value") to a local variable n_ and use it as loop bounds:

  integer :: n    !! dummy argument
  integer :: n_   !! a local variable
  ...
  n_ = n

  do i2 = 1, n_
  do i1 = 1, n_
      b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop
      ...

The result for test_2dim() is as follows:

test_2dim():
    Case 1: gfort => 3.6 s, ifort => 3.1 s
    Case 2: gfort => 15.9 s, ifort => 6.2 s
    Case 3: gfort => 3.7 s, ifort => 6.4 s

Here unfortunately (and in contrast to my expectation), Case 2 did not improve at all... Though the copying to a local n_ should guarantee that n_ is constant in the DO loops, the compiler seems not happy because the array shape is still determined by n, not by n_, thus still avoiding aggressive optimization (<-- just my guess).


Update2

Following the suggestion by @innoSPG, I also changed n to n_ inside DO loops for Case 2, and then it turns out that the code runs as fast as Case 1! Specifically, the code is

  n_ = n

  do i2 = 1, n_
  do i1 = 1, n_
      b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop
      
      n_ptr = n_          !! Case 2 : gfort => 3.7 sec, ifort => 3.1 sec

But as the Answer suggests, this efficiency may be because the assignment statement is totally eliminated by the compiler. So I think I need to consider more practical codes (not too simple ones) to test the impact of pointers or pointer components on loop optimization...

(...I am sorry for a very long question...)

like image 788
roygvib Avatar asked Oct 30 '22 20:10

roygvib


1 Answers

When you do the optimization, the compiler takes more time to learn more about your program, so that it can avoid any nonnecessary computation in addition to taking advantage of the architecture. This is strongly dependent on the compiler and on the architecture.

So my guess is that the compiler knows beforehand that n_ptr and n_mod are exactly the same thing and does not even spend time for the assignment for the case 3. The scenario is similar for case 2 when n is intent(in), the compiler can predict that it is not neccessary to do the assignment in the loop, it needs to be done only once, because n_ptr is not involved in any other computation in the subroutine. I suspect ifort to have missed that point. Also, you might be running on intel-based architecture, this gives some atvantages to ifort.

For the case 2 when n is not intent(in), the compiler keeps in mind that it is a target and can be changed by many other means and there is absolutely no clue of pointers that are pointing to it. This adds to the dereferencing of pointer during the assignment to make the computing time. Basically, loading/saving the value pointed to, from a pointer variable takes twice more time to loading/saving a value from non pointer variable. I do not know how pointers are actually implemented in fortran, so I can not give serious hints of the time factors. It is strongly dependent on the implementation of pointers and targets.

I did not try this yet, but I suggest for your test with local variable n_ you also change the right hand side of the assignment of case 2 to the local variable n_. I have the strong belief that you will get the same time as in case 1, because the compiler can predict that it is not necessary to do the assignment in the loop.

n_ = n
do iloop = 1, nloop

    do i2 = 1, n_
    do i1 = 1, n_
        b(i1,i2) = a(i1,i2) + b(i1,i2) + iloop

         !(nothing here)    !! Case (1)
         !n_ptr = n_          !! Case (2)
         !n_ptr = n_mod      !! Case (3)
    enddo
    enddo

enddo
like image 188
innoSPG Avatar answered Nov 09 '22 15:11

innoSPG