Numba Matrix Vector multiplication

Question

I'm trying to use numbapro to write a simple matrix vector multiplication below:

from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time

n = 100

@cuda.jit('void(float32[:,:], float32[:], float32[:])')
def cu_matrix_vector(A, b, c):
    y, x = cuda.grid(2)
    if y < n:
        c[y] = 0.0

    if x < n and y < n:
        for i in range(n):
            c[y] += A[y, i] * b[i]


A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, 1)), dtype=np.float32)
C = np.empty_like(B)

s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector(dA, dB, dC)
dC.to_host()

e = time()
tcuda = e - s

but I'm getting following error:

numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED Failed to copy memory D->H

I don't understand why the device to host copy is failing. Please help

harrism · Accepted Answer

Your code has multiple problems.

The B and C vectors are Nx1 2D matrices, not 1D vectors, but the type signature of your kernel lists them as "float32[:]" -- 1D vectors. It also indexes them with a single index, which results in runtime errors on the GPU due to misaligned access (cuda-memcheck is your friend here!)
Your kernel assumes a 2D grid, but only uses 1 column of it -- meaning many threads doing the same computation and overwriting each other.
There is no execution configuration given, so NumbaPro is launching a kernel with 1 block of 1 thread. (nvprof is your friend here!)

Here is a code that works. Note that this uses a 1D grid of 1D blocks, and loops over the columns of the matrix. Therefore it is optimized for the case where the number of rows in the vector/matrix is large. A kernel that is optimized for a short and wide matrix would need to use another approach (parallel reductions). But I would use CUBLAS sgemv (which is exposed in NumbaPro also) instead.

from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time

m = 100000 
n = 100

@cuda.jit('void(f4[:,:], f4[:], f4[:])')
def cu_matrix_vector(A, b, c):
    row = cuda.grid(1)
    if (row < m):
        sum = 0

        for i in range(n):
            sum += A[row, i] * b[i]

        c[row] = sum

A = np.array(np.random.random((m, n)), dtype=np.float32)
B = np.array(np.random.random(m), dtype=np.float32)
C = np.empty_like(B)

s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)

cu_matrix_vector[(m+511)/512, 512](dA, dB, dC)

dC.to_host()

print C

e = time()
tcuda = e - s

Numba Matrix Vector multiplication

Tags:

python

numpy

cuda

numba

numba-pro

kirikoumath

1 Answers

harrism

Recent Activity

Donate For Us

Numba Matrix Vector multiplication

Tags:

python

numpy

cuda

numba

numba-pro

kirikoumath

1 Answers

harrism

Related questions

Recent Activity

Donate For Us