Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Avoid Conditionals in Loops

In this link, the author gives an example as

subroutine threshold(a, thresh, ic)
  real, dimension(:), intent(in) :: a
  real, intent(in) :: thresh
  integer, intent(out) :: ic
  real :: tt
  integer :: n

  ic = 0
  tt = 0.d0
  n = size(a)
  do j = 1, n
     tt = tt + a(j) * a(j)
     if (sqrt(tt) >= thresh) then
        ic = j
        return
     end if
  end do
end subroutine threshold

and the author commented this code as

An alternative approach, which would allow for many optimizations (loop unrolling, CPU pipelining, less time spent evaluating the conditional) would involve adding tt in blocks (e.g., blocks of size 128) and checking the conditional after each block. When it the condition is met, the last block can be repeated to determine the value of ic.

What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?

How to optimize the code as the author say?

like image 812
user15964 Avatar asked Dec 27 '15 03:12

user15964


People also ask

Can we run for loop without condition?

Yes it is perfectly correct to do so.

Is condition necessary in while loop?

While Loop is a type of loop that is used when you don't know exactly how many times the code will repeat. It's based on a condition, so the instruction inside the while should be either a boolean value (True/False) or an operator that returns a boolean (<,>,==, etc.).

DO FOR loops check the condition first?

do while loop starts with the execution of the statement(s). There is no checking of any condition for the first time. After the execution of the statements, and update of the variable value, the condition is checked for true or false value. If it is evaluated to true, next iteration of loop starts.

Can we write if condition in for loop?

You can nest If statements inside For Loops. For example you can loop through a list to check if the elements meet certain conditions. You can also have a For Loop inside another For loop.


1 Answers

If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations. If the corresponding block size is BLOCKSIZE, this is achieved by

do j = 1, n, BLOCKSIZE
   do jj = j, j+BLOCKSIZE-1 
      tt = tt + a(jj) * a(jj)
   end do
end do

This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:

do j=(n/BLOCKSIZE)*BLOCKSIZE,n
   ! ...
enddo

While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop. As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.

Here is the complete code:

subroutine threshold_block(a, thresh, ic)
  implicit none
  real, dimension(:), intent(in) :: a
  real, intent(in) :: thresh
  integer, intent(out) :: ic
  real :: tt, tt_bak, thresh_sqr
  integer :: n, j, jj
  integer,parameter :: BLOCKSIZE = 128

  ic = 0
  tt = 0.d0
  thresh_sqr = thresh**2
  n = size(a)
  ! Perform the loop in chunks of BLOCKSIZE
  do j = 1, n, BLOCKSIZE
     tt_bak = tt
     do jj = j, j+BLOCKSIZE-1 
        tt = tt + a(jj) * a(jj)
     end do
     ! Perform the check on the block level
     if (tt >= thresh_sqr) then
        ! If the threshold is reached, repeat the last block 
        ! to determine the last position
        tt = tt_bak
        do jj = j, j+BLOCKSIZE-1 
           tt = tt + a(jj) * a(jj)
           if (tt >= thresh_sqr) then
              ic = jj
              return
           end if
        end do
     end if
  end do

  ! Remainder is treated element-wise
  do j=(n/BLOCKSIZE)*BLOCKSIZE,n
     tt = tt + a(j) * a(j)
     if (tt >= thresh_sqr) then
        ic = j
        return
     end if
  end do 
end subroutine threshold_block

Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.


Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:

do jj = j, j+BLOCKSIZE-1,4
   tt = tt + a(jj)   * a(jj)
   tt = tt + a(jj+1) * a(jj+1)
   tt = tt + a(jj+2) * a(jj+2)
   tt = tt + a(jj+3) * a(jj+3)
end do

Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-) The gfortran compiler option to enable this is -funroll-loops


As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.

Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput. However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.

like image 64
Alexander Vogt Avatar answered Sep 27 '22 20:09

Alexander Vogt