This is my code for speeding up matrix multiplication, but it is only 5% faster than the simple one. What can i do to boost it as much as possible?
*The tables are being accessed for example as: C[sub2ind(i,j,n)] for the C[i, j] position.
void matrixMultFast(float * const C, /* output matrix */
float const * const A, /* first matrix */
float const * const B, /* second matrix */
int const n, /* number of rows/cols */
int const ib, /* size of i block */
int const jb, /* size of j block */
int const kb) /* size of k block */
{
int i=0, j=0, jj=0, k=0, kk=0;
float sum;
for(i=0;i<n;i++)
for(j=0;j<n;j++)
C[sub2ind(i,j,n)]=0;
for(kk=0;kk<n;kk+=kb)
{
for(jj=0;jj<n;jj+=jb)
{
for(i=0;i<n;i++)
{
for(j=jj;j<jj+jb;j++)
{
sum=C[sub2ind(i,j,n)];
for(k=kk;k<kk+kb;k++)
sum += A[sub2ind(i,k,n)]*B[sub2ind(k,j,n)];
C[sub2ind(i,j,n)]=sum;
}
}
}
}
} // end function 'matrixMultFast4'
*It is written in C and it needs to support C99
In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices.
In order to raise a matrix to the power of −2, we simply need to multiply the inverse by itself. This logic can then be extended in the same way as we did for raising the matrix to a positive power. Let's see this in Numpy by comparing the function to calculate the inverse to raising our matrix to the power of -1 .
The major difference from an unblocked matrix multiplication is that we can no longer hold a whole row of A in fast memory because of blocking. For each iteration of kk, we will need to load A[ii, kk] into fast memory. However, blocking increases computational density.
There are many, many things you can do to improve the efficiency of matrix multiplication.
To examine how to improve the basic algorithm, let's first take a look at our current options. The naive implementation, of course, has 3 loops with a time complexity of the order of O(n^3)
. There is another method called Strassen's Method which achieves a appreciable speedup and has the order of O(n^2.73)
(but in practice is useless since it offers no appreciable means of optimization).
This is in theory. Now consider how matrices are stored in memory. Row major is the standard, but you find column major too. Depending on the scheme, transposing your matrix might improve speed due to fewer cache misses. Matrix multiplication in theory is just a bunch of vector dot products and addition. The same vector is to be operated upon by multiple vectors, thus it makes sense to keep that vector in cache for faster access.
Now, with the introduction of multiple cores, parallelism and the cache concept, the game changes. If we look a little closely, we see that a dot product is nothing but a bunch of multiplications followed by summations. These multiplications can be done in parallel. Hence, we can now look at parallel loading of numbers.
Now let's make things a little more complicated. When talking about matrix multiplication, there is a distinction between single floating point and double floating point in their size. Often the former is 32 bits while the latter, 64 (of course, this depends on the CPU). Each CPU only has a fixed number of registers, meaning the bigger your numbers, the lesser you can fit in the CPU. Moral of the story is, stick to single floating point unless you really need double.
Now that we've gone through the basics of how we can tune matrix multiplication, worry not. You need do nothing of what has been discussed above since there are already subroutines to do it. As mentioned in the comments, there's GotoBLAS, OpenBLAS, Intel's MKL, and Apple's Accelerate framework. MKL/Accelerate are proprietary, but OpenBLAS is a very competitive alternative.
Here's a nice little example that multiplies 2 8k x 8k matrices in a few milliseconds on my Macintosh:
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <Accelerate/Accelerate.h>
int SIZE = 8192;
typedef float point_t;
point_t* transpose(point_t* A) {
point_t* At = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
vDSP_mtrans(A, 1, At, 1, SIZE, SIZE);
return At;
}
point_t* dot(point_t* A, point_t* B) {
point_t* C = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
int i;
int step = (SIZE * SIZE / 4);
cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[0], SIZE, B, SIZE, 0.0, &C[0], SIZE);
cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[step], SIZE, B, SIZE, 0.0, &C[step], SIZE);
cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[step * 2], SIZE, B, SIZE, 0.0, &C[step * 2], SIZE);
cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[step * 3], SIZE, B, SIZE, 0.0, &C[step * 3], SIZE);
return C;
}
void print(point_t* A) {
int i, j;
for(i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
printf("%f ", A[i * SIZE + j]);
}
printf("\n");
}
}
int main() {
for(; SIZE <= 8192; SIZE *= 2) {
point_t* A = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
point_t* B = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
srand(getpid());
int i, j;
for(i = 0; i < SIZE * SIZE; i++) {
A[i] = ((point_t)rand() / (double)RAND_MAX);
B[i] = ((point_t)rand() / (double)RAND_MAX);
}
struct timeval t1, t2;
double elapsed_time;
gettimeofday(&t1, NULL);
point_t* C = dot(A, B);
gettimeofday(&t2, NULL);
elapsed_time = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms
elapsed_time += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms
printf("Time taken for %d size matrix multiplication: %lf\n", SIZE, elapsed_time/1000.0);
free(A);
free(B);
free(C);
}
return 0;
}
At this point I should also mention SSE (Streaming SIMD Extension), which is basically something you shouldn't do unless you've worked with assembly. Basically, you're vectorising your C code, to work with vectors instead of integers. This means you can operate on blocks of data instead of single values. The compiler gives up and just translates your code as is without doing its own optimizations. If done right, it can speed up your code like nothing before - you can touch the theoretical floor of O(n^2)
even! But it is easy to abuse SSE, and most people unfortunately do, making the end result worse than before.
I hope this motivates you to dig deeper. The world of matrix multiplication is a large and fascinating one. Below, I attach links for further reading.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With