Large matrix multiplication on gpu

Tags:

I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can anyone give me the name or link of such algorithms.

307

asked Jan 28 '13 07:01

Soroosh Khoram

1 Answers

There isn't really a formal algorithm for this; in general, these sorts of linear algebra operations where the whole problem isn't stored in memory simultaneously are referred to as "out of core" operations.

To solve it, you don't need a particularly elaborate algorithm, just the CUBLAS library and a pencil and paper. For example, you can decompose the matrix product like this:

enter image description here

which gives you four independent sub-matrix multiplication operations. These can be calculated using four calls to CUBLAS gemm using very straightforward host code. You can extend the idea to as many sub-matrices as are required to match the problem size and your GPU capacity. The same principle can also be used to implement matrix multiplication problems on multiple GPUs (see this question for an example).

In the alternative, you can find a working implementation of this precise idea in the Harvard developed SciGPU-GEMM codebase and in the HPL-CUDA linpack implementation (disclaimer: I am affiliated with the latter codebase).

answered Oct 01 '22 01:10

talonmies

Related questions
                            
                                How to debug into CUDA kernel code using visual studio 2008?
                            
                                What do work items execute when conditionals are used in GPU programming?
                            
                                Atomic Operations in CUDA? Which header file to include?
                            
                                Amdahl's law and GPU
                            
                                Cuda Clang and OS X Mavericks
                            
                                SVD speed in CPU and GPU
                            
                                All CUDA devices are used for display: Can not debug my CUDA-code from within desktop environment
                            
                                CUDA function call-able by either the device or host
                            
                                is there a better and a faster way to copy from CPU memory to GPU using thrust?
                            
                                CUDA coalesced access to global memory
                            
                                CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?
                            
                                Is CUDA pinned memory zero-copy?
                            
                                How to list CUDA devices in windows 7 using cmd?
                            
                                Use of unique_ptr and cudaMalloc
                            
                                Parameters to CUDA kernels
                            
                                About cudaMemcpyAsync Function
                            
                                What is the OpenCL analogue for CUDA's __syncthreads() and blockIdx.x?
                            
                                Implementing Max Reduce in Cuda
                            
                                CUDA and pinned (page locked) memory not page locked at all?
                            
                                Why cuSparse is much slower than cuBlas for sparse matrix multiplication

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Large matrix multiplication on gpu

Tags:

cuda

gpgpu

gpu

matrix-multiplication

Soroosh Khoram

People also ask

1 Answers

talonmies

Recent Activity

Donate For Us