Laderman's 3x3 matrix multiplication with only 23 multiplications, is it worth it?

Tags:

Take the product of two 3x3 matrices A*B=C. Naively this requires 27 multiplications using the standard algorithm. If one were clever, you could do this using only 23 multiplications, a result found in 1973 by Laderman. The technique involves saving intermediate steps and combining them in the right way.

Now lets fix a language and a type, say C++ with elements of double. If the Laderman algorithm was hard-coded versus the simple double loop, could we expect the performance of a modern compiler to edge out the differences of the algorithms?

Notes about this question: This is a programming site, and the question is asked in the context of the best practice for a time-critical inner loop; premature optimization this is not. Tips on implementation are greatly welcomed as comments.

374

asked May 31 '12 03:05

Hooked

1 Answers

The key is mastering the instruction set on your platform. It depends on your platform. There are several techniques, and when you tend to need the maximum possible performance, your compiler will come with profiling tools, some of which have optimization hinting built in. For the finest grained operations look at the assembler output and see if there are any improvements at that level as well.

Simultaneous instruction multiple data commands perform the same operation on several operands in parallel. So that you can take

double a,b,c,d;
double w = d + a; 
double x = a + b;
double y = b + c;
double z = c + d;

and replace it with

double256 dabc = pack256(d, a, b, c);
double256 abcd = pack256(a, b, c, d);
double256 wxyz = dabc + abcd;

So that when the values are loaded into registers, they are loaded into a single 256-bit wide register for some fictional platform with 256-bit wide registers.

Floating point is an important consideration, some DSPs can be significantly faster operating on integers. GPUs tend to be great on floating point, although some are 2x faster at single precision. The 3x3 case of this problem could fit into a single CUDA thread, so you could stream 256 of these calculations simultaneously.

Pick your platform, read the documentation, implement several different methods and profile them.

174

answered Oct 03 '22 23:10

totowtwo

Related questions
                            
                                What is the difference between std::cout and std::wcout?
                            
                                Compress Mat into Jpeg And save the result into memory
                            
                                where is the official c++ documentation [closed]
                            
                                Why is the `std::sto`... series not a template?
                            
                                Why doesn't C++ support range based for loop for dynamic arrays?
                            
                                Is the object returned from a function still created when it is not used?
                            
                                Difference between sizeof(struct name_of_struct) vs sizeof(name_of_struct)?
                            
                                how do I set the proper initial locale for a C++ program on Windows?
                            
                                Is there an STL algorithm to find the last instance of a value in a sequence?
                            
                                Atomic swap in GNU C++
                            
                                C++ read from istream until newline (but not whitespace)
                            
                                What's so special about file descriptor 3 on linux?
                            
                                What is the fastest portable way to copy an array in C++
                            
                                Bit field vs Bitset
                            
                                VS 2010 error - cannot open file "iostream"
                            
                                Boost.Asio as header-only
                            
                                Take OpenCV window and make full screen
                            
                                What's preferred pattern for reading lines from a file in C++?
                            
                                Can we return objects having a deleted/private copy/move constructor by value from a function?
                            
                                Replace BOOST_FOREACH with "pure" C++11 alternative?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Laderman's 3x3 matrix multiplication with only 23 multiplications, is it worth it?

Tags:

c++

algorithm

linear-algebra

matrix-multiplication

Hooked

People also ask

1 Answers

totowtwo

Recent Activity

Donate For Us