Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed Up Matrix Multiplication with OpenMP and Block Method: Can I Do Better?

This is the code I wrote:

#include <omp.h>
void matrix_multi(int c[][TSIZE], int a[][TSIZE], int b[][TSIZE])
{
   int B=8;

  int i, j, k,i1,j1,k1;
#pragma omp parallel for private(i,j,k,i1,j1,k1) schedule(auto) collapse(3)
  for (i=0; i<TSIZE; i+=B)
    for (j=0; j<TSIZE; j+=B)
      for (k=0; k<TSIZE; k+=B)
        for (i1=i;i1<i+B;i1++)
          for (j1=j;j1<j+B;j1++)
            {
              int sum=0;
              for (k1=k;k1<k+B;k1++)
                {
                  sum+=a[i1][k1]*b[k1][j1];
                }
              c[i1][j1]+=sum;
            }

}

My question is: Can I get a better performance with some further manipulation on three inner loops?

like image 757
Huanming Song Avatar asked May 18 '16 05:05

Huanming Song


People also ask

Which algorithm is faster for matrix multiplication?

In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices.

How do you optimize a matrix multiplication?

Once a block version of the matrix-matrix multiplication is implemented, one typically further optimize the algorithm by unrolling the innermost loop (i.e., instead of using a for loop to do 8 updates, one write the 8 updates directly in the program) to help the compiler to pipeline the instructions to the CPU.

Does it matter which way you multiply a matrix?

Matrix multiplication is not commutative In other words, in matrix multiplication, the order in which two matrices are multiplied matters!

Can you multiply block matrices?

If matrices and are partitioned compatibly into blocks, the product can be computed by matrix multiplication using blocks as entries.

How fast is matrix multiplication?

As matrices grow larger, the number of multiplications needed to find their product increases much faster than the number of additions. While it takes eight intermediate multiplications to find the product of two-by-two matrices, it takes 64 to find the product of four-by-four matrices.


2 Answers

Linear algebra is one of the most common operations computers perform. In games and graphics libraries it is THE most common operation. It has been studied and optimized heavily, with entire research groups dedicated to it.

If you care about speed, you should be performing matrix multiplication with a BLAS library. Some of the things that a BLAS library will optimize for:

  • minimize cache-misses by performing the matrix multiplication in blocks rather than looping over the entire matrix
  • optimize the block size for the cache-size of the computer
  • if the computer/CPU has multiple cache levels, use multiple optimized block size levels
  • use SIMD instructions if available on the CPU

Note that parallelization is not on the list. This is because in today's computers memory access is slower than the CPU. You will see worse performance with openmp due to the overhead of context switching.

like image 112
zyamys Avatar answered Oct 06 '22 00:10

zyamys


It seems that you are far away from fully optimized. Have you tried loop unroll, loop inversion, etc.?

You could refer to the following link for a step by step optimization on matrix multiplication.

http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/

like image 27
kangshiyin Avatar answered Oct 05 '22 22:10

kangshiyin