I'm trying to compare different methods for matrix multiplication. The first one is normal method: <pre class="prettyprint"><code>do { for (j = 0; j < i; j++) { for (k = 0; k < i; k++) { suma = 0; for (l = 0; l < i; l++) suma += MatrixA[j][l]*MatrixB[l][k]; MatrixR[j][k] = suma; } } } c++; } while (c<iteraciones); </code></pre> The second one consist of transposing the matrix B first and then do the multiplication by rows: <pre class="prettyprint"><code>int f, co; for (f = 0; f < i; f++) { for ( co = 0; co < i; co++) { MatrixB[f][co] = MatrixB[co][f]; } } c = 0; do { for (j = 0; j < i; j++) { for (k = 0; k < i; k++) { suma = 0; for (l = 0; l < i; l++) suma += MatrixA[j][l]*MatrixB[k][l]; MatrixR[j][k] = suma; } } } c++; } while (c<iteraciones); </code></pre> The second method supposed to be much faster, because we are accessing contiguous memory slots, but I'm not getting a significant improvement in the performance. Am I doing something wrong? I can post the complete code, but I think is not needed.

Getting this right is non-trivial. Using an existing BLAS library is highly recommended. Should you really be inclined to roll your own matrix multiplication, loop tiling is an optimization that is of particular importance for large matrices. The tiling should be tuned to the cache size to ensure that the cache is not being continually thrashed, which will occur with a naive implementation. I once measured a 12x performance difference tiling a matrix multiply with matrix sizes picked to consume multiples of my cache (circa '97 so the cache was probably small). Loop tiling algorithms assume that a contiguous linear array of elements is used, as opposed to rows or columns of pointers. With such a storage choice, your indexing scheme determines which dimension changes fastest, and you are free to decide whether row or column access will have the best cache performance. There's a lot of literature on the subject. The following references, especially the Banerjee books, may be helpful: [Ban93] Banerjee, Utpal, Loop Transformations for Restructuring Compilers: the Foundations, Kluwer Academic Publishers, Norwell, MA, 1993. [Ban94] Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, Norwell, MA, 1994. [BGS93] Bacon, David F., Susan L. Graham, and Oliver Sharp, Compiler Transformations for High-Performance Computing, Computer Science Division, University of California, Berkeley, Calif., Technical Report No UCB/CSD-93-781. [LRW91] Lam, Monica S., Edward E. Rothberg, and Michael E Wolf. The Cache Performance and Optimizations of Blocked Algorithms, In 4th International Conference on Architectural Support for Programming Languages, held in Santa Clara, Calif., April, 1991, 63-74. [LW91] Lam, Monica S., and Michael E Wolf. A Loop Transformation Theory and an Algorithm to Maximize Parallelism, In IEEE Transactions on Parallel and Distributed Systems, 1991, 2(4):452-471. [PW86] Padua, David A., and Michael J. Wolfe, Advanced Compiler Optimizations for Supercomputers, In Communications of the ACM, 29(12):1184-1201, 1986. [Wolfe89] Wolfe, Michael J. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, MA, 1989. [Wolfe96] Wolfe, Michael J., High Performance Compilers for Parallel Computing, Addison-Wesley, CA, 1996.

Optimized matrix multiplication in C

Tags:

c

matrix

I'm trying to compare different methods for matrix multiplication. The first one is normal method:

do {     for (j = 0; j < i; j++)     {         for (k = 0; k < i; k++)         {             suma = 0;             for (l = 0; l < i; l++)                 suma += MatrixA[j][l]*MatrixB[l][k];                 MatrixR[j][k] = suma;             }         }     }     c++; } while (c<iteraciones);

The second one consist of transposing the matrix B first and then do the multiplication by rows:

int f, co; for (f = 0; f < i; f++) {     for ( co = 0; co < i; co++) {         MatrixB[f][co] = MatrixB[co][f];     } }  c = 0; do {     for (j = 0; j < i; j++)     {         for (k = 0; k < i; k++)         {             suma = 0;             for (l = 0; l < i; l++)                 suma += MatrixA[j][l]*MatrixB[k][l];                 MatrixR[j][k] = suma;             }         }      }      c++; } while (c<iteraciones);

The second method supposed to be much faster, because we are accessing contiguous memory slots, but I'm not getting a significant improvement in the performance. Am I doing something wrong?

I can post the complete code, but I think is not needed.

928

asked Dec 15 '09 13:12

Peter

2 Answers

What Every Programmer Should Know About Memory (pdf link) by Ulrich Drepper has a lot of good ideas about memory efficiency, but in particular, he uses matrix multiplication as an example of how knowing about memory and using that knowledge can speed this process. Look at appendix A.1 in his paper, and read through section 6.2.1. Table 6.2 in the paper shows that he could get his running time to be 10% from a naive implementation's time for a 1000x1000 matrix.

Granted, his final code is pretty hairy and uses a lot of system-specific stuff and compile-time tuning, but still, if you really need speed, reading that paper and reading his implementation is definitely worth it.

169

answered Oct 10 '22 17:10

Alok Singhal

Getting this right is non-trivial. Using an existing BLAS library is highly recommended.

Should you really be inclined to roll your own matrix multiplication, loop tiling is an optimization that is of particular importance for large matrices. The tiling should be tuned to the cache size to ensure that the cache is not being continually thrashed, which will occur with a naive implementation. I once measured a 12x performance difference tiling a matrix multiply with matrix sizes picked to consume multiples of my cache (circa '97 so the cache was probably small).

Loop tiling algorithms assume that a contiguous linear array of elements is used, as opposed to rows or columns of pointers. With such a storage choice, your indexing scheme determines which dimension changes fastest, and you are free to decide whether row or column access will have the best cache performance.

There's a lot of literature on the subject. The following references, especially the Banerjee books, may be helpful:

[Ban93] Banerjee, Utpal, Loop Transformations for Restructuring Compilers: the Foundations, Kluwer Academic Publishers, Norwell, MA, 1993.

[Ban94] Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, Norwell, MA, 1994.

[BGS93] Bacon, David F., Susan L. Graham, and Oliver Sharp, Compiler Transformations for High-Performance Computing, Computer Science Division, University of California, Berkeley, Calif., Technical Report No UCB/CSD-93-781.

[LRW91] Lam, Monica S., Edward E. Rothberg, and Michael E Wolf. The Cache Performance and Optimizations of Blocked Algorithms, In 4th International Conference on Architectural Support for Programming Languages, held in Santa Clara, Calif., April, 1991, 63-74.

[LW91] Lam, Monica S., and Michael E Wolf. A Loop Transformation Theory and an Algorithm to Maximize Parallelism, In IEEE Transactions on Parallel and Distributed Systems, 1991, 2(4):452-471.

[PW86] Padua, David A., and Michael J. Wolfe, Advanced Compiler Optimizations for Supercomputers, In Communications of the ACM, 29(12):1184-1201, 1986.

[Wolfe89] Wolfe, Michael J. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, MA, 1989.

[Wolfe96] Wolfe, Michael J., High Performance Compilers for Parallel Computing, Addison-Wesley, CA, 1996.

answered Oct 10 '22 18:10

Peeter Joot

Related questions
                            
                                How could I temporary redirect stdout to a file in a C program?
                            
                                Why is the first argument of getline a pointer to pointer "char**" instead of "char*"?
                            
                                Can I have gdb break on read/write from an address? [duplicate]
                            
                                How to initialize const in a struct in C (with malloc)
                            
                                Create an Application without a Window
                            
                                How to check if a pointer is valid? [duplicate]
                            
                                Colons after variable name in C [duplicate]
                            
                                How to align a pointer in C
                            
                                Function Pointer - Automatic Dereferencing [duplicate]
                            
                                Use the same makefile for make (Linux) and nmake (Windows)
                            
                                NOT(~) vs NEGATION(!)
                            
                                Where to begin reading SQLite source code? [closed]
                            
                                Get size of pointer in C
                            
                                How to declare a variable in the scope of a given function with GDB?
                            
                                GCC memory leak detection equivalent to Microsoft crtdbg.h?
                            
                                How to make GCC output to stdout?
                            
                                Why Was the GObject System Created? [closed]
                            
                                'Multipurpose' linked list implementation in pure C
                            
                                Why do C languages require parens around a simple condition in an if statement?
                            
                                Are (bool)(i & 1) and i % 2 == 1 same?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With