Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cuda python GPU numbapro 3d loop poor performance

I am trying to set up a 3D loop with the assignment

 C(i,j,k) = A(i,j,k) + B(i,j,k)

using Python on my GPU. This is my GPU:

http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications

The sources I'm looking at / comparing with are:

http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43

http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb

It's possible that I've imported more modules than necessary. This is my code:

import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *

@autojit
def myAdd(a, b):
  return a+b

myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)

@cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    tz = cuda.threadIdx.z
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    bz = cuda.blockIdx.z
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    bd = cuda.blockDim.z
    i = tx + bx * bw
    j = ty + by * bh
    k = tz + bz * bd
    if i >= c.shape[0]:
      return
    if j >= c.shape[1]:
      return
    if k >= c.shape[2]:
      return
    for i in xrange(0,c.shape[0]):
      for j in xrange(0,c.shape[1]):
        for k in xrange(0,c.shape[2]):
          # c[i,j,k] = a[i,j,k] + b[i,j,k]
          c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])

def main():
    my_gpu = numba.cuda.get_current_device()
    print "Running on GPU:", my_gpu.name
    cores_per_capability = {1: 8,2: 32,3: 192,}
    cc = my_gpu.compute_capability
    print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
    majorcc = cc[0]
    print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
    cores_per_multiprocessor = cores_per_capability[majorcc]
    print "Number of cores per mutliprocessor:", cores_per_multiprocessor
    total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
    print "Number of cores on GPU:", total_cores

    N = 100
    thread_ct = my_gpu.WARP_SIZE
    block_ct = int(math.ceil(float(N) / thread_ct))

    print "Threads per block:", thread_ct
    print "Block per grid:", block_ct

    a = np.ones((N,N,N), dtype = np.float32)
    b = np.ones((N,N,N), dtype = np.float32)
    c = np.zeros((N,N,N), dtype = np.float32)

    start = timer()
    cg = cuda.to_device(c)
    myAdd_kernel[block_ct, thread_ct](a,b,cg)
    cg.to_host()
    dt = timer() - start
    print "Wall clock time with GPU in %f s" % dt
    print 'c[:3,:,:] = ' + str(c[:3,1,1])
    print 'c[-3:,:,:] = ' + str(c[-3:,1,1])


if __name__ == '__main__':
    main()

My result from running this is the following:

Running on GPU: GeForce GT 520
Compute capability:  2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2.  2.  2.]
c[-3:,:,:] = [ 2.  2.  2.]

When I run the examples in the sources, I see significant speedup. I don't think my example is running properly since the wall clock time is much longer than I would expect. I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link.

I believe I've indexed correctly and safely. Maybe the problem is with my blockdim? or griddim? Or maybe I'm using the wrong types for my GPU. I think I read that they must be a certain type. I'm very new to this so the problem very well could be trivial!

Any and all help is greatly appreciated!

like image 435
Charles Avatar asked Oct 20 '22 19:10

Charles


1 Answers

You are creating your indexes correctly but then you're ignoring them. Running the nested loop

for i in xrange(0,c.shape[0]):
    for j in xrange(0,c.shape[1]):
        for k in xrange(0,c.shape[2]):

is forcing all your threads to loop through all values in all dimensions, which is not what you want. You want each thread to compute one value in a block and then move on.

I think something like this should work better...

i = tx + bx * bw
while i < c.shape[0]:
    j = ty+by*bh
    while j < c.shape[1]:
        k = tz + bz * bd
        while k < c.shape[2]:
            c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
            k+=cuda.blockDim.z*cuda.gridDim.z
        j+=cuda.blockDim.y*cuda.gridDim.y
    i+=cuda.blockDim.x*cuda.gridDim.x

Try to compile and run it. Also make sure to validate it, as I have not.

like image 188
Christian Sarofeen Avatar answered Oct 22 '22 21:10

Christian Sarofeen