I am trying to set up a 3D loop with the assignment
C(i,j,k) = A(i,j,k) + B(i,j,k)
using Python on my GPU. This is my GPU:
http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications
The sources I'm looking at / comparing with are:
http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43
http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb
It's possible that I've imported more modules than necessary. This is my code:
import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *
@autojit
def myAdd(a, b):
return a+b
myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)
@cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
tz = cuda.threadIdx.z
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bz = cuda.blockIdx.z
bw = cuda.blockDim.x
bh = cuda.blockDim.y
bd = cuda.blockDim.z
i = tx + bx * bw
j = ty + by * bh
k = tz + bz * bd
if i >= c.shape[0]:
return
if j >= c.shape[1]:
return
if k >= c.shape[2]:
return
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
# c[i,j,k] = a[i,j,k] + b[i,j,k]
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
def main():
my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores
N = 100
thread_ct = my_gpu.WARP_SIZE
block_ct = int(math.ceil(float(N) / thread_ct))
print "Threads per block:", thread_ct
print "Block per grid:", block_ct
a = np.ones((N,N,N), dtype = np.float32)
b = np.ones((N,N,N), dtype = np.float32)
c = np.zeros((N,N,N), dtype = np.float32)
start = timer()
cg = cuda.to_device(c)
myAdd_kernel[block_ct, thread_ct](a,b,cg)
cg.to_host()
dt = timer() - start
print "Wall clock time with GPU in %f s" % dt
print 'c[:3,:,:] = ' + str(c[:3,1,1])
print 'c[-3:,:,:] = ' + str(c[-3:,1,1])
if __name__ == '__main__':
main()
My result from running this is the following:
Running on GPU: GeForce GT 520
Compute capability: 2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2. 2. 2.]
c[-3:,:,:] = [ 2. 2. 2.]
When I run the examples in the sources, I see significant speedup. I don't think my example is running properly since the wall clock time is much longer than I would expect. I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link.
I believe I've indexed correctly and safely. Maybe the problem is with my blockdim? or griddim? Or maybe I'm using the wrong types for my GPU. I think I read that they must be a certain type. I'm very new to this so the problem very well could be trivial!
Any and all help is greatly appreciated!
You are creating your indexes correctly but then you're ignoring them. Running the nested loop
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
is forcing all your threads to loop through all values in all dimensions, which is not what you want. You want each thread to compute one value in a block and then move on.
I think something like this should work better...
i = tx + bx * bw
while i < c.shape[0]:
j = ty+by*bh
while j < c.shape[1]:
k = tz + bz * bd
while k < c.shape[2]:
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
k+=cuda.blockDim.z*cuda.gridDim.z
j+=cuda.blockDim.y*cuda.gridDim.y
i+=cuda.blockDim.x*cuda.gridDim.x
Try to compile and run it. Also make sure to validate it, as I have not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With