Could a CUDA kernel call a cublas function?

Tags:

I know it sound weird, but here is my scenario:

I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated for the output matrix. I searched cublas library and didn't find any level 2 or 3 functions that can do that. So, I decided to distribute each row of A and each column of B into CUDA threads. For each thread (idx), I need to calculate the dot product "A[idx,:]*B[:,idx]" and save it as the corresponding diagonal output. Now since this dot product also takes some time, and I wonder whether I could somehow call cublas function here (say cublasSdot) to achieve it.

If I missed some cublas function that can achieve my goal directly (only calculate the diagonal elements for a matrix-matrix multiplication), this question could be discarded.

872

asked Nov 14 '12 00:11

Hailiang Zhang

2 Answers

Yes, it can (until (and excluding) version CUDA 10).

"The language interface and Device Runtime API available in CUDA C/C++ is a subset of the CUDA Runtime API available on the Host. The syntax and semantics of the CUDA Runtime API have been retained on the device in order to facilitate ease of code reuse for API routines that may run in either the host or device environments. A kernel can also call GPU libraries such as CUBLAS directly without needing to return to the CPU." Source

Here you can see and Matrix-Vector Multiplication using cuda and CUBLAS library function cublasSgemv.

Bear in mind, however that there is no longer a device CUBLAS capability in CUDA 10.. From Robert_Crovella one can cite:

The current recommendation would be to see if CUTLASS 2 will help (it is mostly focused on GEMM related activities). If not, write your own code to perform the function, or call cublas from host code.

Nonetheless, currently there are several implementation online of Matrix-Vector Multiplication, for instance 1, 2, among others.

answered Nov 30 '22 06:11

dreamcrash

Make sure you are using the device library to call the cublas. You can't use the same library that you used to call it from the host; details about using the cuda device library can be found on cuda toolkit: http://docs.nvidia.com/cuda/cublas/index.html#device-api

Look at the cuda 5 samples under 7_CUDALibraries/ .

answered Nov 30 '22 06:11

Sameer Asal

Related questions
                            
                                STL algorithms and concurrent programming
                            
                                R - Parallelizing multiple model learning (with dplyr and purrr)
                            
                                Parallel Linq query optimization
                            
                                What is execution context in Scala?
                            
                                Parallel distance Matrix in R
                            
                                Where does super-linear speedup come from?
                            
                                Which parallel programming APIs do you use? [closed]
                            
                                How to pass quoted args to GNU Parallel
                            
                                Best Spring batch scaling strategy
                            
                                What are "source" and "destination" parameters in MPI_Cart_shift?
                            
                                how to tune the parallelism hint in storm
                            
                                How to set thread number for the parallel collections?
                            
                                How to run a thread separate from main thread in Java?
                            
                                How do I target a specific .NET project within a Solution using MSBuild from VS2010?
                            
                                multiple threads adding elements to one list. why are there always fewer items in the list than expected?
                            
                                Parallelize Scala's Iterator
                            
                                How to read all lines of a file in parallel in Java 8
                            
                                How to extract text from a directory of PDF files efficiently with OCR?
                            
                                Printed output not displayed when using joblib in jupyter notebook
                            
                                Parallelizing SQL queries in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Could a CUDA kernel call a cublas function?

Tags:

parallel-processing

cuda

gpu

cublas

Hailiang Zhang

People also ask

2 Answers

dreamcrash

Sameer Asal

Recent Activity

Donate For Us