Today I started working with CUDA and GPU processing. I found this tutorial: https://www.geeksforgeeks.org/running-python-script-on-gpu/
Unfortunately my first attempt to run gpu code failed:
from numba import jit, cuda
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
@jit(target ="cuda")
def func2(a):
for i in range(10000000):
a[i]+= 1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
b = np.ones(n, dtype = np.float32)
start = timer()
func(a)
print("without GPU:", timer()-start)
start = timer()
func2(a)
print("with GPU:", timer()-start)
Output:
/home/amu/anaconda3/bin/python /home/amu/PycharmProjects/gpu_processing_base/gpu_base_1.py
without GPU: 4.89985659904778
Traceback (most recent call last):
File "/home/amu/PycharmProjects/gpu_processing_base/gpu_base_1.py", line 30, in <module>
func2(a)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/dispatcher.py", line 40, in __call__
return self.compiled(*args, **kws)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/compiler.py", line 758, in __call__
kernel = self.specialize(*args)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/compiler.py", line 769, in specialize
kernel = self.compile(argtypes)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/compiler.py", line 785, in compile
**self.targetoptions)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
TypeError: compile_kernel() got an unexpected keyword argument 'boundscheck'
Process finished with exit code 1
I have installed numba
and cudatoolkit
mentioned in the tutorial in an anaconda environment in pycharm.
Adding an answer to get this off the unanswered queue.
The code in that example is broken. It isn't anything wrong with your numba or CUDA installations. There is no way that the code in your question (or the blog you copied it from) can emit the result the blog post claims.
There are many ways this could potentially be modified to work. One would be like this:
from numba import vectorize, jit, cuda
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
@vectorize(['float64(float64)'], target ="cuda")
def func2(x):
return x+1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
start = timer()
func(a)
print("without GPU:", timer()-start)
start = timer()
func2(a)
print("with GPU:", timer()-start)
Here func2
becomes a ufunc which is compiled for the device. It will then be run over the whole input array on the GPU. Doing so does this:
$ python bogoexample.py
without GPU: 4.314514834433794
with GPU: 0.21419800259172916
So it is faster, but keep in mind that the GPU time includes the time taken for compilation of the GPU ufunc
Another alternative would be to actually write a GPU kernel. Like this:
from numba import vectorize, jit, cuda
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
@vectorize(['float64(float64)'], target ="cuda")
def func2(x):
return x+1
# kernel to run on gpu
@cuda.jit
def func3(a, N):
tid = cuda.grid(1)
if tid < N:
a[tid] += 1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
for i in range(0,5):
start = timer()
func(a)
print(i, " without GPU:", timer()-start)
for i in range(0,5):
start = timer()
func2(a)
print(i, " with GPU ufunc:", timer()-start)
threadsperblock = 1024
blockspergrid = (a.size + (threadsperblock - 1)) // threadsperblock
for i in range(0,5):
start = timer()
func3[blockspergrid, threadsperblock](a, n)
print(i, " with GPU kernel:", timer()-start)
which runs like this:
$ python bogoexample.py
0 without GPU: 4.885275377891958
1 without GPU: 4.748716968111694
2 without GPU: 4.902181145735085
3 without GPU: 4.889955999329686
4 without GPU: 4.881594380363822
0 with GPU ufunc: 0.16726416163146496
1 with GPU ufunc: 0.03758022002875805
2 with GPU ufunc: 0.03580896370112896
3 with GPU ufunc: 0.03530424740165472
4 with GPU ufunc: 0.03579768259078264
0 with GPU kernel: 0.1421878095716238
1 with GPU kernel: 0.04386183246970177
2 with GPU kernel: 0.029975440353155136
3 with GPU kernel: 0.029602501541376114
4 with GPU kernel: 0.029780613258481026
Here you can see that the kernel runs slightly faster than the ufunc, and that caching (and this is caching of the JIT compiled functions, not memoization of the calls) significantly speeds up the call on the GPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With