TLDR; for the ones that wants to avoid reading the whole story: Is there a way to interface RcppArmadillo with NVBLAS to make use of the GPU, much more like you'd do to interface Armadillo with NVBLAS using pure c++ code without R?
I'm trying to make use of the NVBLAS library (http://docs.nvidia.com/cuda/nvblas/) to speed up the linear algebra part in my projects (Computational Statistics mostly, MCMC, particle filters and all those goodies) by diverting some computations to the GPUs.
I use mostly C++ code and in particular the Armadillo library for matrix computations and, by their FAQs I came to know that I can use NVBLAS just by linking armadillo in the correct way (http://arma.sourceforge.net/faq.html).
So I set up my installation of the libraries and write the following dummy prog:
#include <armadillo>
int main(){
arma::mat A = arma::randn<arma::mat>(3000,2000);
arma::mat B = cov(A);
arma::vec V = arma::randn(2000);
arma::mat C; arma::mat D;
for(int i = 0; i<20; ++i){ C = solve(V,B); D = inv(B); }
return 0;
}
compile it with
g++ arma_try.cpp -o arma_try.so -larmadillo
and profile it with
nvprof ./arma_try.so
The profiler output shows:
==11798== Profiling application: ./arma_try.so
==11798== Profiling result:
Time(%) Time Calls Avg Min Max Name
72.15% 4.41253s 580 7.6078ms 1.0360ms 14.673ms void magma_lds128_dgemm_kernel<bool=0, bool=0, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
20.75% 1.26902s 1983 639.95us 1.3440us 2.9929ms [CUDA memcpy HtoD]
4.06% 248.17ms 1 248.17ms 248.17ms 248.17ms void fermiDsyrk_v2_kernel_core<bool=1, bool=1, bool=0, bool=1>(double*, int, int, int, int, int, int, double const *, double const *, double, double, int)
1.81% 110.54ms 1 110.54ms 110.54ms 110.54ms void fermiDsyrk_v2_kernel_core<bool=0, bool=1, bool=0, bool=1>(double*, int, int, int, int, int, int, double const *, double const *, double, double, int)
1.05% 64.023ms 581 110.19us 82.913us 12.211ms [CUDA memcpy DtoH]
0.11% 6.9438ms 1 6.9438ms 6.9438ms 6.9438ms void gemm_kernel2x2_tile_multiple_core<double, bool=1, bool=0, bool=0, bool=1, bool=0>(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int)
0.06% 3.3712ms 1 3.3712ms 3.3712ms 3.3712ms void gemm_kernel2x2_core<double, bool=0, bool=0, bool=0, bool=1, bool=0>(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int)
0.02% 1.3192ms 1 1.3192ms 1.3192ms 1.3192ms void syherk_kernel_core<double, double, int=256, int=4, bool=1, bool=0, bool=0, bool=1, bool=0, bool=1>(cublasSyherkParams<double, double>)
0.00% 236.03us 1 236.03us 236.03us 236.03us void syherk_kernel_core<double, double, int=256, int=4, bool=0, bool=0, bool=0, bool=1, bool=0, bool=1>(cublasSyherkParams<double, double>)
where I recognise dgemm and others ... so it's working! Wonderful.
Now I'd like to run the same code but interfaced with R, as I need sometimes to do input/output and plots with it. RcppArmadillo has always worked wonders for me, providing alongside Rcpp all the tools I needed. I thus write the cpp:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
int arma_call(){
arma::mat A = arma::randn<arma::mat>(3000,2000);
arma::mat B = cov(A);
arma::vec V = arma::randn(2000);
arma::mat C; arma::mat D;
for(int i = 0; i<20; ++i){ C = solve(V,B); D = inv(B); }
return 0;
}
and the R script:
Rcpp::sourceCpp('arma_try_R.cpp')
arma_call()
and try to execute it by running on the console
nvprof R CMD BATCH arma_try_R.R
(edit: note using Rscript rather than R CMD BATCH produces the same results) BUT
[NVBLAS] Cannot open default config file 'nvblas.conf'
weird...maybe R cannot reach the file for whatever reason, so I copy it into the working directory and re-run the code:
==12662== NVPROF is profiling process 12662, command: /bin/sh /usr/bin/R CMD BATCH arma_try_R.R
==12662== Profiling application: /bin/sh /usr/bin/R CMD BATCH arma_try_R.R
==12662== Profiling result: No kernels were profiled.
I have no idea what's causing it. I am on a linux system with Bumblebee installed though, so as a last chance I tried:
nvprof optirun R CMD BATCH arma_try_R.R
to sort-of force R to run with the Nvidia card and this time the output is
==10900== Profiling application: optirun R CMD BATCH arma_try_R.R
==10900== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 1.3760us 1 1.3760us 1.3760us 1.3760us [CUDA memcpy HtoD]
so no call to the cuda library at all, nor any computation delegated to the GPU as far as I can tell with the profiler. Now the questions are many actually, not just one.
Is this because of the way the code is compiled inside R? The verbose mode shows
/usr/lib64/R/bin/R CMD SHLIB -o 'sourceCpp_27457.so' --preclean 'arma_try_R.cpp'
g++ -I/usr/include/R/ -DNDEBUG -D_FORTIFY_SOURCE=2 -I"/home/marco/R/x86_64-unknown-linux-gnu-library/3.2/Rcpp/include" -I"/home/marco/R/x86_64-unknown-linux-gnu-library/3.2/RcppArmadillo/include" -I"/home/marco/prova_cuda" -fpic -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4 -c arma_try_R.cpp -o arma_try_R.o
g++ -shared -L/usr/lib64/R/lib -Wl,-O1,--sort-common,--as-needed,-z,relro -lblas -llapack -o sourceCpp_27457.so arma_try_R.o -llapack -lblas -lgfortran -lm -lquadmath -L/usr/lib64/R/lib -lR
and even if I force the -larmadillo rather than -lblas flag (via the PKG_LIBS env var) nothing changes.
If you need an more output I can provide what's needed, thanks for reading this far anyway!
EDIT:
ldd /usr/lib/R/lib/libR.so
[NVBLAS] Using devices :0
linux-vdso.so.1 (0x00007ffdb5bd6000)
/opt/cuda/lib64/libnvblas.so (0x00007f4afaccd000)
libblas.so => /usr/lib/libblas.so (0x00007f4afa6ea000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f4afa3ec000)
libreadline.so.6 => /usr/lib/libreadline.so.6 (0x00007f4afa1a1000)
libpcre.so.1 => /usr/lib/libpcre.so.1 (0x00007f4af9f31000)
liblzma.so.5 => /usr/lib/liblzma.so.5 (0x00007f4af9d0b000)
libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x00007f4af9afa000)
libz.so.1 => /usr/lib/libz.so.1 (0x00007f4af98e4000)
librt.so.1 => /usr/lib/librt.so.1 (0x00007f4af96dc000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f4af94d7000)
libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f4af92b5000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f4af9098000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f4af8cf3000)
/usr/lib64/ld-linux-x86-64.so.2 (0x0000556509792000)
libcublas.so.7.5 => /opt/cuda/lib64/libcublas.so.7.5 (0x00007f4af7414000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f4af7092000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f4af6e7b000)
libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x00007f4af6c0e000)
So apart from the weird [NVBLAS] Using devices :0
it seems at least that R is aware of the cuda nvblas library...
To answer my own question: YES, it's possible and it suffice to make R point to the right (NV)BLAS libraries and RcppArmadillo will go fetch the routines in the right place (you might want to read Dirk Eddelbuettel comment to the question to see why )
Now onto the specifics of my problem and the reason for the self-answer:
I think the problem was not where I thought it was.
when running nvidia-smi
on another terminal than the one running Rscript arma_try_R.R
I get for example
+------------------------------------------------------+
| NVIDIA-SMI 352.41 Driver Version: 352.41 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 860M Off | 0000:01:00.0 Off | N/A |
| N/A 64C P0 N/A / N/A | 945MiB / 2047MiB | 21% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 20962 C /usr/lib64/R/bin/exec/R 46MiB |
| 0 21598 C nvidia-smi 45MiB |
+-----------------------------------------------------------------------------+
meaning that the GPU is indeed working!
The problem lies hence in the nvprof routine, that cannot detect it and sometimes freezes my Rscript. But that's another completely unrelated question.
(I'll wait to accept it as an answer to see if someone else comes and resolves it more cleverly...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With