I want to migrate a piece of code that involves a number of vector and matrix calculations to C or C++, the objective being to speed up the code as much as possible.
Are linear algebra calculations with for
loops in C code as fast as calculations using LAPACK/BLAS, or there is some unique speedup from using those libraries?
In other words, could simple C code (using for
loops and the like) perform linear algebra calculations as fast as code that utilizes LAPACK/BLAS?
BLAS (Basic Linear Algebra Subprograms) is a library of vector, vector-vector, matrix-vector and matrix-matrix operations. LAPACK, a library of dense and banded matrix linear algebra routines such as solving linear systems, the eigenvalue- and singular value decomposition.
LAPACK relies on an underlying BLAS implementation to provide efficient and portable computational building blocks for its routines. LAPACK was designed as the successor to the linear equations and linear least-squares routines of LINPACK and the eigenvalue routines of EISPACK.
LAPACK is a set of Fortran subroutines covering a wide area of linear algebra algorithms. It was developed with the intention of being portable across a range of parallel processing environments.
Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets.
Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, i.e. operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK.
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets.
In that sense,
Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. I.e. something like:
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
vecmul[i] = src1[i] * src2[i];
vecmul[i+1] = src1[i+1] * src2[i+1];
vecmul[i+2] = src1[i+2] * src2[i+2];
vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];
might be a better feed to a vectorizing compiler than the plain
for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];
expression. In some ways, what you mean by calculations with for loops will have a significant impact.
If your vector dimensions are large enough though, the BLAS version,
dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);
will be cleaner code and likely faster.
On the reference side, these might be of interest:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With