Why is boosts matrix multiplication slower than mine?

Tags:

I have implemented one matrix multiplication with boost::numeric::ublas::matrix (see my full, working boost code)

Result result = read ();  boost::numeric::ublas::matrix<int> C; C = boost::numeric::ublas::prod(result.A, result.B);

and another one with the standard algorithm (see full standard code):

vector< vector<int> > ijkalgorithm(vector< vector<int> > A,                                      vector< vector<int> > B) {     int n = A.size();      // initialise C with 0s     vector<int> tmp(n, 0);     vector< vector<int> > C(n, tmp);      for (int i = 0; i < n; i++) {         for (int k = 0; k < n; k++) {             for (int j = 0; j < n; j++) {                 C[i][j] += A[i][k] * B[k][j];             }         }     }     return C; }

This is how I test the speed:

time boostImplementation.out > boostResult.txt diff boostResult.txt correctResult.txt  time simpleImplementation.out > simpleResult.txt diff simpleResult.txt correctResult.txt

Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. Both programs were compiled with these flags:

g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic

I got 15 seconds for my implementation and over 4 minutes for the boost-implementation!

edit: After compiling it with

g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out

I got 28.19 seconds for the ikj-algorithm and 60.99 seconds for Boost. So Boost is still considerably slower.

Why is boost so much slower than my implementation?

321

asked Jun 19 '12 22:06

Martin Thoma

2 Answers

Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.

Here's the time taken by the uBLAS version with debugging on:

real    0m19.966s user    0m19.809s sys     0m0.112s

Here's the time taken by the uBLAS version with debugging off (-DNDEBUG -DBOOST_UBLAS_NDEBUG compiler flags added):

real    0m7.061s user    0m6.936s sys     0m0.096s

So with debugging off, uBLAS version is almost 3 times faster.

Remaining performance difference can be explained by quoting the following section of uBLAS FAQ "Why is uBLAS so much slower than (atlas-)BLAS":

An important design goal of ublas is to be as general as possible.

This generality almost always comes with a cost. In particular the prod function template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod and block_prod. Here are the results of comparing different methods:

ijkalgorithm   prod   axpy_prod  block_prod    1.335       7.061    1.330       1.278

As you can see both axpy_prod and block_prod are somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod (I used 64) can make the difference more profound.

See also uBLAS FAQ and Effective uBlas and general code optimization.

122

answered Oct 13 '22 09:10

vitaut

I believe, your compiler doesn't optimize enough. uBLAS code makes heavy use of templates and templates require heavy use of optimizations. I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me

10.064s for uBLAS

7.851s for vector

The difference is still there, but by no means overwhelming. uBLAS's core concept is lazy evaluation, so prod(A, B) evaluates results only when needed, e.g. prod(A, B)(10,100) will execute in no time, since only that one element will actually be calculated. As such there's actually ~~no dedicated algorithm for whole matrix multiplication which could be optimized~~ (see below). But you could help the library a little, declaring

matrix<int, column_major> B;

will reduce running time to 4.426s which beats your function with one hand tied. This declaration makes access to memory more sequential when multiplying matrices, optimizing cache usage.

P.S. Having read uBLAS documentation to the end ;), you should have found out that there's actually a dedicated function to multiply whole matrices at once. 2 functions - axpy_prod and opb_prod. So

opb_prod(A, B, C, true);

even on unoptimized row_major B matrix executes in 8.091 sec and is on par with your vector algorithm

P.P.S. There's even more optimizations:

C = block_prod<matrix<int>, 1024>(A, B);

executes in 4.4s, no matter whether B is column_ or row_ major. Consider the description: "The function block_prod is designed for large dense matrices." Choose specific tools for specific tasks!

answered Oct 13 '22 08:10

panda-34

Related questions
                            
                                Does dereferencing a pointer make a copy of it?
                            
                                How to define a C++ preprocessor macro through the command line with CMake?
                            
                                What does "[ this ]" mean in C++
                            
                                Why does std::declval add a reference?
                            
                                Automatically stop Visual C++ 2008 build at first compile error?
                            
                                Multicharacter literal in C and C++
                            
                                member initializer does not name a non-static data member or base class [duplicate]
                            
                                What happens to unique_ptr after std::move()?
                            
                                How to write an unsigned short int literal?
                            
                                Move constructor for std::mutex
                            
                                Is incrementing a null pointer well-defined?
                            
                                Why is the Loki library not more widely used?
                            
                                C++11 STL containers and thread safety
                            
                                Is there any reason to use std::map::emplace() instead of try_emplace() in C++1z?
                            
                                What exactly is std::labs() there for?
                            
                                What is the meaning of a C++ Wrapper Class?
                            
                                static constexpr member of same type as class being defined
                            
                                Does casting to an int after std::floor guarantee the right result?
                            
                                Why is C++11's POD "standard layout" definition the way it is?
                            
                                How to print __int128 in g++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is boosts matrix multiplication slower than mine?

Tags:

c++

performance

boost

ublas

boost-ublas

Martin Thoma

People also ask

2 Answers

vitaut

panda-34

Recent Activity

Donate For Us