Improving the performance of Matrix Multiplication

Tags:

This is my code for speeding up matrix multiplication, but it is only 5% faster than the simple one. What can i do to boost it as much as possible?

*The tables are being accessed for example as: C[sub2ind(i,j,n)] for the C[i, j] position.

void matrixMultFast(float * const C,            /* output matrix */
                float const * const A,      /* first matrix */
                float const * const B,      /* second matrix */
                int const n,                /* number of rows/cols */
                int const ib,               /* size of i block */
                int const jb,               /* size of j block */
                int const kb)               /* size of k block */
{

int i=0, j=0, jj=0, k=0, kk=0;
float sum;

for(i=0;i<n;i++)
    for(j=0;j<n;j++)
        C[sub2ind(i,j,n)]=0;

for(kk=0;kk<n;kk+=kb)
{
    for(jj=0;jj<n;jj+=jb)
    {
        for(i=0;i<n;i++)
        {
            for(j=jj;j<jj+jb;j++)
            {
                sum=C[sub2ind(i,j,n)];
                for(k=kk;k<kk+kb;k++)
                    sum += A[sub2ind(i,k,n)]*B[sub2ind(k,j,n)];
                C[sub2ind(i,j,n)]=sum;
            }
        }
    }
}
} // end function 'matrixMultFast4'

*It is written in C and it needs to support C99

629

asked Jun 05 '17 18:06

Kostas C.

1 Answers

There are many, many things you can do to improve the efficiency of matrix multiplication.

To examine how to improve the basic algorithm, let's first take a look at our current options. The naive implementation, of course, has 3 loops with a time complexity of the order of O(n^3). There is another method called Strassen's Method which achieves a appreciable speedup and has the order of O(n^2.73) (but in practice is useless since it offers no appreciable means of optimization).

This is in theory. Now consider how matrices are stored in memory. Row major is the standard, but you find column major too. Depending on the scheme, transposing your matrix might improve speed due to fewer cache misses. Matrix multiplication in theory is just a bunch of vector dot products and addition. The same vector is to be operated upon by multiple vectors, thus it makes sense to keep that vector in cache for faster access.

Now, with the introduction of multiple cores, parallelism and the cache concept, the game changes. If we look a little closely, we see that a dot product is nothing but a bunch of multiplications followed by summations. These multiplications can be done in parallel. Hence, we can now look at parallel loading of numbers.

Now let's make things a little more complicated. When talking about matrix multiplication, there is a distinction between single floating point and double floating point in their size. Often the former is 32 bits while the latter, 64 (of course, this depends on the CPU). Each CPU only has a fixed number of registers, meaning the bigger your numbers, the lesser you can fit in the CPU. Moral of the story is, stick to single floating point unless you really need double.

Now that we've gone through the basics of how we can tune matrix multiplication, worry not. You need do nothing of what has been discussed above since there are already subroutines to do it. As mentioned in the comments, there's GotoBLAS, OpenBLAS, Intel's MKL, and Apple's Accelerate framework. MKL/Accelerate are proprietary, but OpenBLAS is a very competitive alternative.

Here's a nice little example that multiplies 2 8k x 8k matrices in a few milliseconds on my Macintosh:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <Accelerate/Accelerate.h>

int SIZE = 8192;

typedef float point_t;

point_t* transpose(point_t* A) {    
    point_t* At = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));    
    vDSP_mtrans(A, 1, At, 1, SIZE, SIZE);

    return At;
}

point_t* dot(point_t* A, point_t* B) {
    point_t* C = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));       
    int i;    
    int step = (SIZE * SIZE / 4);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[0], SIZE, B, SIZE, 0.0, &C[0], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step], SIZE, B, SIZE, 0.0, &C[step], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 2], SIZE, B, SIZE, 0.0, &C[step * 2], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 3], SIZE, B, SIZE, 0.0, &C[step * 3], SIZE);      

    return C;
}

void print(point_t* A) {
    int i, j;
    for(i = 0; i < SIZE; i++) {
        for(j = 0; j < SIZE; j++) {
            printf("%f  ", A[i * SIZE + j]);
        }
        printf("\n");
    }
}

int main() {
    for(; SIZE <= 8192; SIZE *= 2) {
        point_t* A = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
        point_t* B = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));

        srand(getpid());

        int i, j;
        for(i = 0; i < SIZE * SIZE; i++) {
            A[i] = ((point_t)rand() / (double)RAND_MAX);
            B[i] = ((point_t)rand() / (double)RAND_MAX);
        }

        struct timeval t1, t2;
        double elapsed_time;

        gettimeofday(&t1, NULL);
        point_t* C = dot(A, B);
        gettimeofday(&t2, NULL);

        elapsed_time = (t2.tv_sec - t1.tv_sec) * 1000.0;      // sec to ms
        elapsed_time += (t2.tv_usec - t1.tv_usec) / 1000.0;   // us to ms

        printf("Time taken for %d size matrix multiplication: %lf\n", SIZE, elapsed_time/1000.0);

        free(A);
        free(B);
        free(C);

    }
    return 0;
}

At this point I should also mention SSE (Streaming SIMD Extension), which is basically something you shouldn't do unless you've worked with assembly. Basically, you're vectorising your C code, to work with vectors instead of integers. This means you can operate on blocks of data instead of single values. The compiler gives up and just translates your code as is without doing its own optimizations. If done right, it can speed up your code like nothing before - you can touch the theoretical floor of O(n^2) even! But it is easy to abuse SSE, and most people unfortunately do, making the end result worse than before.

I hope this motivates you to dig deeper. The world of matrix multiplication is a large and fascinating one. Below, I attach links for further reading.

OpenBLAS
More about SSE
Intel Intrinsics

answered Sep 22 '22 05:09

cs95

Related questions
                            
                                Set C11 as default Language in Clion
                            
                                Two-part for loop in C [duplicate]
                            
                                How do you include standard CUDA libraries to link with NVRTC code?
                            
                                c stdout to stdin realtime
                            
                                Shuffle AVX 256 Vector elements by 1 position left/right - C intrinsics
                            
                                How to ensure enum and array have same entries at compile?
                            
                                How can I convert a vector of float to short int using avx instructions?
                            
                                Trouble compiling C code: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'int'
                            
                                Is there any harm in putting comments first in C header file?
                            
                                How does getmaxyx work? (from curses)
                            
                                How to pass pointer to slice to C function in go
                            
                                Function definitions of built-in functions in C
                            
                                void* will have the same representation and memory alignment as a pointer to char
                            
                                GCC Assembly "+t"
                            
                                Simplest format to store uncompressed RGBA image data
                            
                                C Function is deprecated
                            
                                Associativity of floating-point multiplication in a special case
                            
                                Why doesn't GCC throw a warning in this example
                            
                                Unable to understand pthread_create() behaviour in the following program?
                            
                                scanf(), field width, inf and nan

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improving the performance of Matrix Multiplication

Tags:

c

matrix

c99

matrix-multiplication

Kostas C.

People also ask

1 Answers

cs95

Recent Activity

Donate For Us