Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do most compilers optimize MATMUL(TRANSPOSE(A),B)?

In a Fortran program, I need to compute several expressions like M · v, MT · v, MT · M, M · MT, etc ... Here, M and v are 2D and 1D arrays of small size (less than 100, typically around 2-10)

I was wondering if writing MATMUL(TRANSPOSE(M),v) would unfold at compile time into some code as efficient as MATMUL(N,v), where N is explicitly stored as N=TRANSPOSE(M). I am specifically interested in the gnu and ifort compilers with "strong" optimization flags (-O2, -O3 or -Ofast for instance).

like image 327
G. Fougeron Avatar asked Mar 18 '19 13:03

G. Fougeron


1 Answers

Below you find a couple of execution times of various methods.

system:

  • Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz
  • cache size : 6144 KB
  • RAM : 16MB
  • GNU Fortran (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
  • ifort (IFORT) 18.0.5 20180823
  • BLAS : for gnu compiler, the used blas is the default version

compilation:

[gnu] $ gfortran -O3 x.f90 -lblas
[intel] $ ifort -O3 -mkl x.f90

execution:

[gnu] $ ./a.out > matmul.gnu.txt
[intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt

In order, to make the results as neutral as possible, I've rescaled the answers with the average time of an equivalent set of operations done. I ignored threading.

matrix times vector

Six different implementations were compared:

  1. manual: do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
  2. matmul: matmul(P,v)
  3. blas N:dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
  4. matmul-transpose: matmul(transpose(P),v)
  5. matmul-transpose-tmp: Q=transpose(P); w=matmul(Q,v)
  6. blas T: dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)

In Figure 1 and Figure 2 you can compare the timing results for the above cases. Overall we can say that the usage of a temporary is in both gfortran and ifort not advised. Both compilers can optimize MATMUL(TRANSPOSE(P),v) much better. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster.

enter image description here figure 1: Matrix-vector multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

enter image description here figure 2: Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

matrix times matrix

Six different implementations were compared:

  1. manual: do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
  2. matmul: matmul(P,P)
  3. blas N:dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
  4. matmul-transpose: matmul(transpose(P),P)
  5. matmul-transpose-tmp: Q=transpose(P); matmul(Q,P)
  6. blas T: dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)

In Figure 3 and Figure 4 you can compare the timing results for the above cases. In contrast to the vector-case, the usage of a temporary is only advised for gfortran. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster. Remarkably, the manual implementation is comparable to mkl-blas.

enter image description here figure 3: Matrix-matrix multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.

enter image description here figure 4: Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.


The used code:

program matmul_test

  implicit none

  double precision, dimension(:,:), allocatable :: P,Q,R
  double precision, dimension(:), allocatable :: v,w

  integer :: n,i,j,k,l
  double precision,dimension(12) :: t1,t2

  do n = 1,1000
     allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
     call random_number(P)
     call random_number(v)

     i=0

     i=i+1
     call cpu_time(t1(i))
     do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     w=matmul(P,v)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     w=matmul(transpose(P),v)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=transpose(P)
     w=matmul(Q,v)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=matmul(P,P)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=matmul(transpose(P),P)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=transpose(P)
     R=matmul(Q,P)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
     call cpu_time(t2(i))

     write(*,'(I6,12D25.17)') n, t2-t1
     deallocate(P,Q,R,v,w)
  end do

end program matmul_test
like image 136
kvantour Avatar answered Oct 29 '22 22:10

kvantour