Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Laderman's 3x3 matrix multiplication with only 23 multiplications, is it worth it?

Take the product of two 3x3 matrices A*B=C. Naively this requires 27 multiplications using the standard algorithm. If one were clever, you could do this using only 23 multiplications, a result found in 1973 by Laderman. The technique involves saving intermediate steps and combining them in the right way.

Now lets fix a language and a type, say C++ with elements of double. If the Laderman algorithm was hard-coded versus the simple double loop, could we expect the performance of a modern compiler to edge out the differences of the algorithms?

Notes about this question: This is a programming site, and the question is asked in the context of the best practice for a time-critical inner loop; premature optimization this is not. Tips on implementation are greatly welcomed as comments.

like image 374
Hooked Avatar asked May 31 '12 03:05

Hooked


People also ask

What is the minimum number of multiplications required to multiply the matrices?

Two matrices can only be multiplied when the column number of the first matrix is equal to the row number of the second matrix.

What is the minimum number of multiplications required to multiply the three matrices?

What is the minimum number of multiplications required to multiply the three matrices? Explanation: The minimum number of multiplications are 18000. This is the case when the matrices are parenthesized as (P*Q)*R.

How many multiplications are required for Strassen's matrix multiplication?

To do it you must form the 7 combinations of a's and b's to be multiplied.

Why is Stressen's matrix multiplication better?

Strassen's matrix multiplication (MM) has benefits with respect to any (highly tuned) implementations of MM because Strassen's reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs with matrix additions (MAs).


1 Answers

The key is mastering the instruction set on your platform. It depends on your platform. There are several techniques, and when you tend to need the maximum possible performance, your compiler will come with profiling tools, some of which have optimization hinting built in. For the finest grained operations look at the assembler output and see if there are any improvements at that level as well.

Simultaneous instruction multiple data commands perform the same operation on several operands in parallel. So that you can take

double a,b,c,d;
double w = d + a; 
double x = a + b;
double y = b + c;
double z = c + d;

and replace it with

double256 dabc = pack256(d, a, b, c);
double256 abcd = pack256(a, b, c, d);
double256 wxyz = dabc + abcd;

So that when the values are loaded into registers, they are loaded into a single 256-bit wide register for some fictional platform with 256-bit wide registers.

Floating point is an important consideration, some DSPs can be significantly faster operating on integers. GPUs tend to be great on floating point, although some are 2x faster at single precision. The 3x3 case of this problem could fit into a single CUDA thread, so you could stream 256 of these calculations simultaneously.

Pick your platform, read the documentation, implement several different methods and profile them.

like image 174
totowtwo Avatar answered Oct 03 '22 23:10

totowtwo