Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Row major vs Column Major Matrix Multiplication

I am currently working on a C program trying to compute Matrix Multiplication.. I have approached this task by looping through each column of the second matrix as seen below.

I have set size to 1000.

for(i=0;i<size;i++)
{
  for(j=0;j<size;j++)
  {
    for(k=0;k<size;k++)
    {
      matC[i][j]+=matA[i][k]*matB[k][j];
    }
  }
}

I wanted to know what problematic access pattern is in this implementation.. What makes row/column access more efficient than the other? I am trying to understand this in terms of logic from the use of Caches.. Please help me understand this. Your help is much appreciated :)

like image 307
WoLv3rine Avatar asked Jan 15 '23 08:01

WoLv3rine


1 Answers

If you are talking about use of Caches then you might want to do something called loop tiling. You break the loop into tiles such that inner part of the loop gets stored inside cache (which is quite large these days). So your loop will turn into something like (if you are passing the matrices into a function using pointers )

for(j=0;j<size;j+=t)
    for(k=0;k<size;k+=t)
       for(i=0;i<size;i+=t)
          for(ii=i;ii<MIN(i+t,size);ii++)
             for(jj=j;jj<MIN(j+t,size);jj++)
        {       
          var=*(c+ii * size+jj);    //Var is a scalar variable                     
          for(kk=k;kk<MIN(k+t,size);kk++)
              {                      
         var = var + *(a+ii *size +kk) * *(bt +jj * size+ kk);          
              }
          *(c+ii *size +jj) = var;
        }                       

The value of t varies depending on the speedup that you get. It can t = 64,128,256 and so on. There are many other techniques that you can use here. Loop tiling is just once technique to utilize the cache efficiently.Further, you can transpose the B matrix before you send to the multiplication function. That way you will get a linear access of elements of matrix B. To explain you more Consider

          A  --------   and B | | | |
             --------         | | | |
             --------         | | | | 
             --------         | | | |

Here, you will always consider, to multiply the first row of A with first column of B.And since you are using C I believe, CPU requires extra efforts to read in the all the columns of matrix B one by one inside the memory. To ease up these efforts, you can transpose the matrix and get the rows of matrix B' (which are nothing but columns of B essentially) and use loop tiling to cache the maximum amount of elements for multiplication.Hope this helps.

like image 85
Recker Avatar answered Jan 29 '23 21:01

Recker