Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large matrix multiplication on gpu

I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can anyone give me the name or link of such algorithms.

like image 307
Soroosh Khoram Avatar asked Jan 28 '13 07:01

Soroosh Khoram


People also ask

Why is matrix multiplication faster on GPU?

In your case of matrix multiplication. You can parallelize the computations, Because GPU have much more threads and in each thread you have multiple blocks. So a lot of computations are parallelized, resulting quick computations.

How do you do large matrix multiplication?

Multiplying Larger Matrices For the entry in the ith row and the jth column of the product matrix, multiply each entry in the ith row of the first matrix by the corresponding entry in the jth column of the second matrix and adding the results.

Are GPUs good at linear algebra?

Because basic numerical linear algebra operations play crucial roles in real time 3D computer graphics, GPUs are designed for this set of operations. Because GPUs offer higher peak performance and bandwidth, numerical linear algebra applications can deliver much higher performance than merely using multi-core CPUs.


1 Answers

There isn't really a formal algorithm for this; in general, these sorts of linear algebra operations where the whole problem isn't stored in memory simultaneously are referred to as "out of core" operations.

To solve it, you don't need a particularly elaborate algorithm, just the CUBLAS library and a pencil and paper. For example, you can decompose the matrix product like this:

enter image description here

which gives you four independent sub-matrix multiplication operations. These can be calculated using four calls to CUBLAS gemm using very straightforward host code. You can extend the idea to as many sub-matrices as are required to match the problem size and your GPU capacity. The same principle can also be used to implement matrix multiplication problems on multiple GPUs (see this question for an example).

In the alternative, you can find a working implementation of this precise idea in the Harvard developed SciGPU-GEMM codebase and in the HPL-CUDA linpack implementation (disclaimer: I am affiliated with the latter codebase).

like image 84
talonmies Avatar answered Oct 01 '22 01:10

talonmies