This is the code I wrote:
#include <omp.h>
void matrix_multi(int c[][TSIZE], int a[][TSIZE], int b[][TSIZE])
{
int B=8;
int i, j, k,i1,j1,k1;
#pragma omp parallel for private(i,j,k,i1,j1,k1) schedule(auto) collapse(3)
for (i=0; i<TSIZE; i+=B)
for (j=0; j<TSIZE; j+=B)
for (k=0; k<TSIZE; k+=B)
for (i1=i;i1<i+B;i1++)
for (j1=j;j1<j+B;j1++)
{
int sum=0;
for (k1=k;k1<k+B;k1++)
{
sum+=a[i1][k1]*b[k1][j1];
}
c[i1][j1]+=sum;
}
}
My question is: Can I get a better performance with some further manipulation on three inner loops?
In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices.
Once a block version of the matrix-matrix multiplication is implemented, one typically further optimize the algorithm by unrolling the innermost loop (i.e., instead of using a for loop to do 8 updates, one write the 8 updates directly in the program) to help the compiler to pipeline the instructions to the CPU.
Matrix multiplication is not commutative In other words, in matrix multiplication, the order in which two matrices are multiplied matters!
If matrices and are partitioned compatibly into blocks, the product can be computed by matrix multiplication using blocks as entries.
As matrices grow larger, the number of multiplications needed to find their product increases much faster than the number of additions. While it takes eight intermediate multiplications to find the product of two-by-two matrices, it takes 64 to find the product of four-by-four matrices.
Linear algebra is one of the most common operations computers perform. In games and graphics libraries it is THE most common operation. It has been studied and optimized heavily, with entire research groups dedicated to it.
If you care about speed, you should be performing matrix multiplication with a BLAS library. Some of the things that a BLAS library will optimize for:
Note that parallelization is not on the list. This is because in today's computers memory access is slower than the CPU. You will see worse performance with openmp due to the overhead of context switching.
It seems that you are far away from fully optimized. Have you tried loop unroll, loop inversion, etc.?
You could refer to the following link for a step by step optimization on matrix multiplication.
http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With