In a Fortran program, I need to compute several expressions like M · v, MT · v, MT · M, M · MT, etc ... Here, M and v are 2D and 1D arrays of small size (less than 100, typically around 2-10)
I was wondering if writing MATMUL(TRANSPOSE(M),v) would unfold at compile time into some code as efficient as MATMUL(N,v), where N is explicitly stored as N=TRANSPOSE(M). I am specifically interested in the gnu and ifort compilers with "strong" optimization flags (-O2, -O3 or -Ofast for instance).
Below you find a couple of execution times of various methods.
system:
- Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz
- cache size : 6144 KB
- RAM : 16MB
- GNU Fortran (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
- ifort (IFORT) 18.0.5 20180823
- BLAS : for gnu compiler, the used blas is the default version
compilation:
[gnu] $ gfortran -O3 x.f90 -lblas [intel] $ ifort -O3 -mkl x.f90execution:
[gnu] $ ./a.out > matmul.gnu.txt [intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt
In order, to make the results as neutral as possible, I've rescaled the answers with the average time of an equivalent set of operations done. I ignored threading.
Six different implementations were compared:
do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
matmul(P,v)
dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
matmul(transpose(P),v)
Q=transpose(P); w=matmul(Q,v)
dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
In Figure 1 and Figure 2 you can compare the timing results for the above cases. Overall we can say that the usage of a temporary is in both gfortran and ifort not advised. Both compilers can optimize MATMUL(TRANSPOSE(P),v) much better. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster.
figure 1: Matrix-vector multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
figure 2: Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
Six different implementations were compared:
do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
matmul(P,P)
dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
matmul(transpose(P),P)
Q=transpose(P); matmul(Q,P)
dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
In Figure 3 and Figure 4 you can compare the timing results for the above cases. In contrast to the vector-case, the usage of a temporary is only advised for gfortran. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster. Remarkably, the manual implementation is comparable to mkl-blas.
figure 3: Matrix-matrix multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
figure 4: Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
The used code:
program matmul_test
implicit none
double precision, dimension(:,:), allocatable :: P,Q,R
double precision, dimension(:), allocatable :: v,w
integer :: n,i,j,k,l
double precision,dimension(12) :: t1,t2
do n = 1,1000
allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
call random_number(P)
call random_number(v)
i=0
i=i+1
call cpu_time(t1(i))
do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(P,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(transpose(P),v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
w=matmul(Q,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(P,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(transpose(P),P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
R=matmul(Q,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
write(*,'(I6,12D25.17)') n, t2-t1
deallocate(P,Q,R,v,w)
end do
end program matmul_test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With