Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LAPACK/BLAS versus simple "for" loops

I want to migrate a piece of code that involves a number of vector and matrix calculations to C or C++, the objective being to speed up the code as much as possible.

Are linear algebra calculations with for loops in C code as fast as calculations using LAPACK/BLAS, or there is some unique speedup from using those libraries?

In other words, could simple C code (using for loops and the like) perform linear algebra calculations as fast as code that utilizes LAPACK/BLAS?

like image 217
behzad.nouri Avatar asked Feb 21 '11 03:02

behzad.nouri


People also ask

What is the difference between Blas and LAPACK?

BLAS (Basic Linear Algebra Subprograms) is a library of vector, vector-vector, matrix-vector and matrix-matrix operations. LAPACK, a library of dense and banded matrix linear algebra routines such as solving linear systems, the eigenvalue- and singular value decomposition.

Does LAPACK use Blas?

LAPACK relies on an underlying BLAS implementation to provide efficient and portable computational building blocks for its routines. LAPACK was designed as the successor to the linear equations and linear least-squares routines of LINPACK and the eigenvalue routines of EISPACK.

Is LAPACK parallel?

LAPACK is a set of Fortran subroutines covering a wide area of linear algebra algorithms. It was developed with the intention of being portable across a range of parallel processing environments.


1 Answers

Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets.

Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, i.e. operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK.
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets.

In that sense,

  • what exactly is your data set ?
  • What matrix/vector sizes ?
  • What linear algebra operations ?

Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. I.e. something like:

for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
    vecmul[i] = src1[i] * src2[i];
    vecmul[i+1] = src1[i+1] * src2[i+1];
    vecmul[i+2] = src1[i+2] * src2[i+2];
    vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
    dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];

might be a better feed to a vectorizing compiler than the plain

for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];

expression. In some ways, what you mean by calculations with for loops will have a significant impact.
If your vector dimensions are large enough though, the BLAS version,

dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);

will be cleaner code and likely faster.

On the reference side, these might be of interest:

  • Intel Math Kernel Libraries Documentation (LAPACK / BLAS and others optimized for Intel CPUs)
  • Intel Performance Primitives Documentation (optimized for small vectors / geometry processing)
  • AMD Core Math Libraries (LAPACK / BLAS and others for AMD CPUs)
  • Eigen Libraries (a "nicer" linear algebra interface)
like image 156
FrankH. Avatar answered Sep 18 '22 23:09

FrankH.