Armadillo+NVBLAS into RcppArmadillo+NVBLAS

Question

TLDR; for the ones that wants to avoid reading the whole story: Is there a way to interface RcppArmadillo with NVBLAS to make use of the GPU, much more like you'd do to interface Armadillo with NVBLAS using pure c++ code without R?

I'm trying to make use of the NVBLAS library (http://docs.nvidia.com/cuda/nvblas/) to speed up the linear algebra part in my projects (Computational Statistics mostly, MCMC, particle filters and all those goodies) by diverting some computations to the GPUs.

I use mostly C++ code and in particular the Armadillo library for matrix computations and, by their FAQs I came to know that I can use NVBLAS just by linking armadillo in the correct way (http://arma.sourceforge.net/faq.html).

So I set up my installation of the libraries and write the following dummy prog:

#include <armadillo>
int main(){

arma::mat A = arma::randn<arma::mat>(3000,2000);
arma::mat B = cov(A);
arma::vec V = arma::randn(2000);
arma::mat C; arma::mat D;

for(int i = 0; i<20; ++i){ C = solve(V,B); D = inv(B);  }

return 0;
}

compile it with

g++ arma_try.cpp -o arma_try.so -larmadillo

and profile it with

nvprof ./arma_try.so

The profiler output shows:

==11798== Profiling application: ./arma_try.so
==11798== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 72.15%  4.41253s       580  7.6078ms  1.0360ms  14.673ms  void magma_lds128_dgemm_kernel<bool=0, bool=0, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
 20.75%  1.26902s      1983  639.95us  1.3440us  2.9929ms  [CUDA memcpy HtoD]
  4.06%  248.17ms         1  248.17ms  248.17ms  248.17ms  void fermiDsyrk_v2_kernel_core<bool=1, bool=1, bool=0, bool=1>(double*, int, int, int, int, int, int, double const *, double const *, double, double, int)
  1.81%  110.54ms         1  110.54ms  110.54ms  110.54ms  void fermiDsyrk_v2_kernel_core<bool=0, bool=1, bool=0, bool=1>(double*, int, int, int, int, int, int, double const *, double const *, double, double, int)
  1.05%  64.023ms       581  110.19us  82.913us  12.211ms  [CUDA memcpy DtoH]
  0.11%  6.9438ms         1  6.9438ms  6.9438ms  6.9438ms  void gemm_kernel2x2_tile_multiple_core<double, bool=1, bool=0, bool=0, bool=1, bool=0>(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int)
  0.06%  3.3712ms         1  3.3712ms  3.3712ms  3.3712ms  void gemm_kernel2x2_core<double, bool=0, bool=0, bool=0, bool=1, bool=0>(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int)
  0.02%  1.3192ms         1  1.3192ms  1.3192ms  1.3192ms  void syherk_kernel_core<double, double, int=256, int=4, bool=1, bool=0, bool=0, bool=1, bool=0, bool=1>(cublasSyherkParams<double, double>)
  0.00%  236.03us         1  236.03us  236.03us  236.03us  void syherk_kernel_core<double, double, int=256, int=4, bool=0, bool=0, bool=0, bool=1, bool=0, bool=1>(cublasSyherkParams<double, double>)

where I recognise dgemm and others ... so it's working! Wonderful.

Now I'd like to run the same code but interfaced with R, as I need sometimes to do input/output and plots with it. RcppArmadillo has always worked wonders for me, providing alongside Rcpp all the tools I needed. I thus write the cpp:

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
int arma_call(){

  arma::mat A = arma::randn<arma::mat>(3000,2000);
  arma::mat B = cov(A);
  arma::vec V = arma::randn(2000);
  arma::mat C; arma::mat D;

  for(int i = 0; i<20; ++i){ C = solve(V,B); D = inv(B);  }

  return 0;
}

and the R script:

Rcpp::sourceCpp('arma_try_R.cpp')
arma_call()

and try to execute it by running on the console

nvprof R CMD BATCH arma_try_R.R

(edit: note using Rscript rather than R CMD BATCH produces the same results) BUT

[NVBLAS] Cannot open default config file 'nvblas.conf'

weird...maybe R cannot reach the file for whatever reason, so I copy it into the working directory and re-run the code:

==12662== NVPROF is profiling process 12662, command: /bin/sh /usr/bin/R CMD BATCH arma_try_R.R
==12662== Profiling application: /bin/sh /usr/bin/R CMD BATCH arma_try_R.R
==12662== Profiling result: No kernels were profiled.

I have no idea what's causing it. I am on a linux system with Bumblebee installed though, so as a last chance I tried:

nvprof optirun R CMD BATCH arma_try_R.R

to sort-of force R to run with the Nvidia card and this time the output is

==10900== Profiling application: optirun R CMD BATCH arma_try_R.R
==10900== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  1.3760us         1  1.3760us  1.3760us  1.3760us  [CUDA memcpy HtoD]

so no call to the cuda library at all, nor any computation delegated to the GPU as far as I can tell with the profiler. Now the questions are many actually, not just one.

Is this just a matter of profiler that cannot track the calls inside R? (I doubt it)
Is this because of the way the code is compiled inside R? The verbose mode shows

/usr/lib64/R/bin/R CMD SHLIB -o 'sourceCpp_27457.so' --preclean 'arma_try_R.cpp'

g++ -I/usr/include/R/ -DNDEBUG -D_FORTIFY_SOURCE=2 -I"/home/marco/R/x86_64-unknown-linux-gnu-library/3.2/Rcpp/include" -I"/home/marco/R/x86_64-unknown-linux-gnu-library/3.2/RcppArmadillo/include" -I"/home/marco/prova_cuda" -fpic -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4 -c arma_try_R.cpp -o arma_try_R.o

g++ -shared -L/usr/lib64/R/lib -Wl,-O1,--sort-common,--as-needed,-z,relro -lblas -llapack -o sourceCpp_27457.so arma_try_R.o -llapack -lblas -lgfortran -lm -lquadmath -L/usr/lib64/R/lib -lR

and even if I force the -larmadillo rather than -lblas flag (via the PKG_LIBS env var) nothing changes.

Is there any way to make it work? Am I missing something?

If you need an more output I can provide what's needed, thanks for reading this far anyway!

EDIT:

ldd /usr/lib/R/lib/libR.so 
[NVBLAS] Using devices :0 
    linux-vdso.so.1 (0x00007ffdb5bd6000)
    /opt/cuda/lib64/libnvblas.so (0x00007f4afaccd000)
    libblas.so => /usr/lib/libblas.so (0x00007f4afa6ea000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007f4afa3ec000)
    libreadline.so.6 => /usr/lib/libreadline.so.6 (0x00007f4afa1a1000)
    libpcre.so.1 => /usr/lib/libpcre.so.1 (0x00007f4af9f31000)
    liblzma.so.5 => /usr/lib/liblzma.so.5 (0x00007f4af9d0b000)
    libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x00007f4af9afa000)
    libz.so.1 => /usr/lib/libz.so.1 (0x00007f4af98e4000)
    librt.so.1 => /usr/lib/librt.so.1 (0x00007f4af96dc000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f4af94d7000)
    libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f4af92b5000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f4af9098000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007f4af8cf3000)
    /usr/lib64/ld-linux-x86-64.so.2 (0x0000556509792000)
    libcublas.so.7.5 => /opt/cuda/lib64/libcublas.so.7.5 (0x00007f4af7414000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f4af7092000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f4af6e7b000)
    libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x00007f4af6c0e000)

So apart from the weird [NVBLAS] Using devices :0 it seems at least that R is aware of the cuda nvblas library...

FooBant · Accepted Answer

To answer my own question: YES, it's possible and it suffice to make R point to the right (NV)BLAS libraries and RcppArmadillo will go fetch the routines in the right place (you might want to read Dirk Eddelbuettel comment to the question to see why )

Now onto the specifics of my problem and the reason for the self-answer:

I think the problem was not where I thought it was.

when running nvidia-smi on another terminal than the one running Rscript arma_try_R.R I get for example

+------------------------------------------------------+                       
| NVIDIA-SMI 352.41     Driver Version: 352.41         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 860M    Off  | 0000:01:00.0     Off |                  N/A |
| N/A   64C    P0    N/A /  N/A |    945MiB /  2047MiB |     21%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     20962    C   /usr/lib64/R/bin/exec/R                         46MiB |
|    0     21598    C   nvidia-smi                                      45MiB |
+-----------------------------------------------------------------------------+

meaning that the GPU is indeed working!

The problem lies hence in the nvprof routine, that cannot detect it and sometimes freezes my Rscript. But that's another completely unrelated question.

(I'll wait to accept it as an answer to see if someone else comes and resolves it more cleverly...)

Armadillo+NVBLAS into RcppArmadillo+NVBLAS

Tags:

c++

cuda

rcpp

armadillo

FooBant

1 Answers

FooBant

Recent Activity

Donate For Us

Armadillo+NVBLAS into RcppArmadillo+NVBLAS

Tags:

c++

cuda

rcpp

armadillo

FooBant

1 Answers

FooBant

Related questions

Recent Activity

Donate For Us