How to Avoid Conditionals in Loops

Tags:

In this link, the author gives an example as

subroutine threshold(a, thresh, ic)
  real, dimension(:), intent(in) :: a
  real, intent(in) :: thresh
  integer, intent(out) :: ic
  real :: tt
  integer :: n

  ic = 0
  tt = 0.d0
  n = size(a)
  do j = 1, n
     tt = tt + a(j) * a(j)
     if (sqrt(tt) >= thresh) then
        ic = j
        return
     end if
  end do
end subroutine threshold

and the author commented this code as

An alternative approach, which would allow for many optimizations (loop unrolling, CPU pipelining, less time spent evaluating the conditional) would involve adding tt in blocks (e.g., blocks of size 128) and checking the conditional after each block. When it the condition is met, the last block can be repeated to determine the value of ic.

What does it mean? loop unrolling? CPU pipelining? adding tt in blocks?

How to optimize the code as the author say?

812

asked Dec 27 '15 03:12

user15964

1 Answers

If the loop is performed in chunks/blocks that fit into the CPU cache you will reduce the number of cache misses, and consequently the number of cache lines retrieved from memory. This increases the performance on all loops that are limited by memory operations. If the corresponding block size is BLOCKSIZE, this is achieved by

do j = 1, n, BLOCKSIZE
   do jj = j, j+BLOCKSIZE-1 
      tt = tt + a(jj) * a(jj)
   end do
end do

This, however, will leave a remainder that is not treated in the main loop. To illustrate this, consider an array of length 1000. The first seven chunks (1--896) are covered in the loop, but the eighth one (897--1024) is not. Therefore, another loop for the remainder is required:

do j=(n/BLOCKSIZE)*BLOCKSIZE,n
   ! ...
enddo

While it makes little sense to remove the conditional from the remainder loop, it can be performed in the outer loop of the blocked main loop. As now no branches occur in the inner loop, aggressive optimizations might be applicable then.
However, this limits the "accuracy" of the determined position to the blocks. To get to an element-wise accuracy, you have to repeat the calculation.

Here is the complete code:

subroutine threshold_block(a, thresh, ic)
  implicit none
  real, dimension(:), intent(in) :: a
  real, intent(in) :: thresh
  integer, intent(out) :: ic
  real :: tt, tt_bak, thresh_sqr
  integer :: n, j, jj
  integer,parameter :: BLOCKSIZE = 128

  ic = 0
  tt = 0.d0
  thresh_sqr = thresh**2
  n = size(a)
  ! Perform the loop in chunks of BLOCKSIZE
  do j = 1, n, BLOCKSIZE
     tt_bak = tt
     do jj = j, j+BLOCKSIZE-1 
        tt = tt + a(jj) * a(jj)
     end do
     ! Perform the check on the block level
     if (tt >= thresh_sqr) then
        ! If the threshold is reached, repeat the last block 
        ! to determine the last position
        tt = tt_bak
        do jj = j, j+BLOCKSIZE-1 
           tt = tt + a(jj) * a(jj)
           if (tt >= thresh_sqr) then
              ic = jj
              return
           end if
        end do
     end if
  end do

  ! Remainder is treated element-wise
  do j=(n/BLOCKSIZE)*BLOCKSIZE,n
     tt = tt + a(j) * a(j)
     if (tt >= thresh_sqr) then
        ic = j
        return
     end if
  end do 
end subroutine threshold_block

Please note that the compilers are nowadays very good in creating blocked loops in combination with other optimizations. In my experience it is quite difficult to get a better performance out of such simple loops by manually tweaking it.
Loop blocking is enabled in gfortran with the compiler option -floop-block.

Loop unrolling can be done manually, but should be left to the compiler. The idea is to manually perform a loop in blocks and instead of a second loop as shown above, perform the operations by duplicating the code. Here is an example for the inner loop as given above, for a loop unrolling of factor four:

do jj = j, j+BLOCKSIZE-1,4
   tt = tt + a(jj)   * a(jj)
   tt = tt + a(jj+1) * a(jj+1)
   tt = tt + a(jj+2) * a(jj+2)
   tt = tt + a(jj+3) * a(jj+3)
end do

Here, no remainder can occur if BLOCKSIZE is a multiple of 4. You can probably shave off a few operations in here ;-) The gfortran compiler option to enable this is -funroll-loops

As far as I know, CPU Pipelining (Instruction Pipelining) cannot be enforced manually in Fortran. This task is up to the compiler.

Pipelining sets up a pipe of instructions. You feed the complete array into that pipe and, after the wind-up phase, you will get a result with each clock cycle. This drastically increases the throughput. However, branches are difficult (impossible?) to treat in pipes, and the array should be long enough that the time required for setting up the pipe, wind-up, and wind-down phase are compensated.

answered Sep 27 '22 20:09

Alexander Vogt

Related questions
                            
                                Does PHP handle numerically-indexed arrays differently (internally)?
                            
                                Which one is better in calling a function: two times or storing the result in a variable?
                            
                                LibGDX HTML runs slowly
                            
                                How to free memory after array in PHP? unset and = null seems to not working
                            
                                Thousands of ORMLite raw inserts taking several minutes on Android
                            
                                Python using lambda to apply pd.DataFrame instead for nested loop is it possible?
                            
                                fastest way to get a hash from a list in python
                            
                                Couchbase Query Execution time?
                            
                                Multi Threading Java
                            
                                Jmeter JSON Path Assertion
                            
                                How to efficiently sum two columns in a file with 270,000+ rows in bash
                            
                                Ruby on Rails - Bullet/N+1
                            
                                Why does Java CPU profile (using visualvm) show so many hits on a method that does nothing?
                            
                                What is the difference between split method in String class and the split method in Apache StringUtils?
                            
                                Javascript Module Pattern create multiple instances using Prototype
                            
                                jmeter latency vs actual browser load test
                            
                                Is there a difference in runtime efficiency if I evaluate the size of the array outside the loop?
                            
                                Java Math.pow(x,2.0) vs Math.pow(x,2.0000001) performance
                            
                                Efficiently calculating weighted distance in MATLAB
                            
                                Slow execution under 64 bits. Possible RyuJIT bug?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Avoid Conditionals in Loops

Tags:

performance

optimization

fortran

fortran90

user15964

People also ask

1 Answers

Alexander Vogt

Recent Activity

Donate For Us