Blas seems very slow

Question

I am comparing matrix multiplications on my machine, and it seems like the c++ blas is very slow. It takes about 4 seconds to multiply a 1000x1000 matrix, and the same time required in python is about 1.5 seconds. I think there might be something wrong with the linking, but I really do not know much how to fix these kind of things. Here is the c++ code

    #include <stdio.h>
#include <iostream>
#include <time.h>
#include <gsl/gsl_blas.h>
#include <gsl/gsl_sf_bessel.h>
using namespace std;


double diffclock(clock_t clock1,clock_t clock2) { double diffticks=clock1-clock2; double diffms=(diffticks*1000)/CLOCKS_PER_SEC; return diffms; }




int
main (void)
{
  double* a=new double[1000*1000];

  double* b=new double[1000*1000];

  double* c=new double[1000*1000];

  for (int i=0;i<1000*1000;i++){
  a[i]=i;
  b[i]=i/5+i*i/100;}

  gsl_matrix_view A = gsl_matrix_view_array(a, 1000, 1000);
  gsl_matrix_view B = gsl_matrix_view_array(b, 1000, 1000);
  gsl_matrix_view C = gsl_matrix_view_array(c, 1000, 1000);

  /* Compute C = A B */
  cout<<"start"<<endl;
  clock_t begin=clock();

  gsl_blas_dgemm (CblasNoTrans, CblasNoTrans,
                  1.0, &A.matrix, &B.matrix,
                  0.0, &C.matrix);
  clock_t end=clock();
  cout<<double(diffclock(end,begin))<<endl;

  return 0;  
}

I am compiling using //g++ -o program mm.cpp -I/home/gsl/include -lm -L/home/gsl/lib -lgsl -lgslcblas

The python code is

    import time
import numpy as np



n=1000
a=np.zeros((n,n))
b=np.zeros((n,n))
for i in range(0,n):
    for j in range(0,n):
        a[i,j]=i*n+j
        b[i,j]=(i*n+j)/5+(n*i+j)**2/5
print "start"
start=time.time()
c=np.dot(a,b)
end=time.time()
print end-start

thanks for any help!

Robert T. McGibbon · Accepted Answer

The subroutines in BLAS are a de facto standard, and a whole slew of optimized and vendor-specific libraries exist that implement the interface. Both numpy and the gsl can be linked against a variety of different BLASes (or in some circumstances use their own implementations), but from this perspective both numpy and the gsl are pretty much wrappers -- the performance you get is basically dependent only on the BLAS that they're linked against.

With the GSL, it's relatively easy to link against an alternative BLAS. There are some instructions here: http://www.gnu.org/software/gsl/manual/html_node/Linking-with-an-alternative-BLAS-library.html

Intel's MKL is one BLAS that is generally quite fast (at least if you don't have an AMD cpu) but notoriously difficult to link against. They even have a web application to help you write the link line: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor. I have had pretty good luck with OpenBLAS (http://www.openblas.net/), getting performance thats within 1 or 2 percent of MKL on an i7-3770K CPU. OpenBLAS is also pretty easy to compile; it's much less of a headache than ATLAS.

Once you get OpenBLAS, either by compiling from source or downloading from your package manager if you're on a *nix, your modified compilation line would be basically

g++ -o program mm.cpp -I/home/gsl/include -lm -L/home/gsl/lib -lgsl -lcblas -lopenblas

Blas seems very slow

Tags:

c++

python

numpy

blas

Jonathan Lindgren

1 Answers

Robert T. McGibbon

Recent Activity

Donate For Us

Blas seems very slow

Tags:

c++

python

numpy

blas

Jonathan Lindgren

1 Answers

Robert T. McGibbon

Related questions

Recent Activity

Donate For Us