Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is row-major ordering more efficient for matrix-vector multiplication?

If M is an n x m matrix and v and u are vectors, then in terms of indices, matrix-vector multiplication looks like u[i] = sum(M[i,j] v_j, 1 <= j <= m). Since v is a vector, its elements are presumably stored in consecutive memory locations for numerical-computation-oriented languages. If M is stored in row-major order (as in C, Mathematica, and Pascal), then the subsequent M[i,j] in the sum are also stored in consecutive memory locations as j is incremented, making the iteration very efficient. If it's stored in column-major order (as in Fortran, Matlab, R, and Julia), then incrementing j requires moving over by a number of memory locations equal to the outer matrix stride, which in this case equals n. This naively seems less efficient for matrices with many rows. (For matrix-matrix multiplication the problem doesn't come up, because under either ordering convention, incrementing the summed index requires moving by the major stride in one matrix's memory or the other.)

Is the difference between moving over in memory by one unit and by many units appreciable or negligible in most computer architectures, compared to the multiplication and addition operations? (I'm guessing "negligible", since in practice Fortran is typically at least as fast as C, but can anyone elaborate why?)

like image 969
tparker Avatar asked Sep 13 '25 17:09

tparker


1 Answers

The difference is expected to be high in most computer architectures, at least in principle.

Matrix-vector multiplication is a memory-bound computation because the reusage of memory is low. All (N) components of v are reused to compute each element of u but each element of the matrix (N^2) is just used once. If we consider the latency of a typical memory (see e.g., https://gist.github.com/hellerbarde/2843375) as (less than) 100ns compared to the time required to perform a floating point operation (less than 1ns) we see that the majority of time is spent loading and storing values from/to arrays.

We still can implement it cache-friendly, i.e. having data locality as much as possible. Since the memory is loaded to the cache as lines, we have to use a loaded line of cache as much as possible. That is why accessing contiguous memory regions reduce the time spent loading data from memory.

To support this, let us try a very simple code:

program mv
integer, parameter :: n=10000
real, allocatable :: M(:,:), v(:), u(:)
real :: start, finish
integer :: i, j
allocate(M(n,n),v(n),u(n))
call random_number(M)
call random_number(v)
u(:)=0.
call cpu_time(start)
do i=1,n
do j=1,n
    ! non-contiguous order
    u(i)=u(i)+M(i,j)*v(j)
    ! contiguous order
    ! u(i)=u(i)+M(j,i)*v(j)
enddo
enddo
call cpu_time(finish)
print*,'elapsed time: ',finish-start
end program mv

Some results:

               non-contiguous order   contiguous order
gfortran -O0            1.                 0.5
gfortran -O3           0.3                 0.1
ifort -O0              1.5                0.85
ifort -O3            0.037               0.035

As you can see, the difference is significant compiling without optimizations. Enabling optimization gfortran still shows significant differences, whereas with ifort there is only a small difference. Looking at the compiler report, it seems that the compiler interchanged the loops, thus leading to a contiguous access on the inner loop.

However, can we say that a language having row-major ordering is more efficient for matrix-vector computation? No, I cannot say that. Not only because the compiler can compensate the difference. The code itself does not know everything about rows and columns of M: it basically knows that M has two indexes, one of which -- depending on the language -- contiguous in memory. For matrix-vector the best for data locality is having the "fast" index mapped to the matrix row index. You can achieve this with both "row-major" and "column-major" languages. You just have to store the values of M according to this. As an example if you have the "algebraic" matrix

     [ M11 M12 ]
M =  [         ]
     [ M21 M22 ]

you store it as "computational matrix"

C       ==> M[1,1] = M11 ; M[1,2] = M12 ; M[2,1] = M21 ; M[2,2] = M22 
Fortran ==> M[1,1] = M11 ; M[2,1] = M12 ; M[1,2] = M21 ; M[2,2] = M22 

so that you are always contiguous in a "algebraic matrix" row. The computer does not know anything on the initial matrix but we know that the computational matrix is the tranposed version of the algebraic matrix. In both cases, I will have the inner loop iterating over a contiguous index and the final result will be the same vector.

In a complex code, if I have already allocated and filled the matrix with values and I cannot decide to store the transposed matrix, it is potentially possible that the "row-major" language gives best performances. But, interchanging the loops (see https://en.wikipedia.org/wiki/Loop_interchange) as automatically done by intel compilers and as done by BLAS implementations (see http://www.netlib.org/lapack/explore-html/db/d58/sgemv_8f_source.html), reduce the differences to very small differences values. Therefore, using Fortran you can prefer:

do j=1,n
    do i=1,n
        u(i)=u(i)+M(i,j)*v(j)
    enddo
enddo
like image 83
Franz Avatar answered Sep 15 '25 16:09

Franz