I took a program that makes a mandelbrot plot and made it run on a CPU thread using njit. Now I want to generate a 32k image but even a whole thread is too slow. So I tried to make the code run on a GPU. Here is the code:
from numba import njit, cuda, vectorize
from PIL import Image, ImageDraw
@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):
z = 0
n = 0
while abs(z) <= 2 and n < 80:
z = z*z + c
n += 1
return n
def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):
for x in range(0, WIDTH):
for y in range(0, HEIGHT):
print(x)
# Convert pixel coordinate to complex number
c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
IM_START + (y / HEIGHT) * (IM_END - IM_START))
# Compute the number of iterations
m = mandelbrot(c)
# The color depends on the number of iterations
color = 255 - int(m * 255 / 80)
# Plot the point
draw.point([x, y], (color, color, color))
def vai():
# Image size (pixels)
WIDTH = 15360
HEIGHT = 8640
# Plot window
RE_START = -2
RE_END = 1
IM_START = -1
IM_END = 1
palette = []
im = Image.new('RGB', (WIDTH, HEIGHT), (0, 0, 0))
draw = ImageDraw.Draw(im)
vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )
im.save('output.png', 'PNG')
vai()
And here is the error:
D:\anaconda\python.exe C:/Users/techguy/PycharmProjects/mandelbrot/main.py
0
Traceback (most recent call last):
File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 56, in <module>
vai()
File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 52, in vai
vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )
File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 30, in vari
m = mandelbrot(c)
File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 41, in __call__
return CUDAUFuncMechanism.call(self.functions, args, kws)
File "D:\anaconda\lib\site-packages\numba\np\ufunc\deviceufunc.py", line 301, in call
cr.launch(func, shape[0], stream, devarys)
File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 152, in launch
func.forall(count, stream=stream)(*args)
File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 372, in __call__
kernel = self.kernel.specialize(*args)
File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 881, in specialize
specialization = Dispatcher(self.py_func, [types.void(*argtypes)],
File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 808, in __init__
self.compile(sigs[0])
File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 935, in compile
kernel.bind()
File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 576, in bind
self._func.get()
File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 446, in get
ptx = self.ptx.get()
File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 414, in get
arch = nvvm.get_arch_option(*cc)
File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
return 'compute_%d%d' % arch
TypeError: not enough arguments for format string
Process finished with exit code 1
If I substitute @vectorize
with @njit(nogil=true)
it works fine but it runs on CPU. I absolutely need it to run on GPU. I think the problem is something like the complex type.
What is the problem?
The code is not mine: I found it at How to plot the Mandelbrot set .
I just modified some pieces.
Here is a minimal reproducible example:
from numba import cuda, vectorize
@vectorize(['int32(complex128)'], target='cuda')
def mandelbrot(c):
z = 0
n = 0
while abs(z) <= 2 and n < 80:
z = z*z + c
n += 1
return n
comple = complex(10, 12)
print(mandelbrot(comple))
Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model. Kernels written in Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between the CPU and the GPU automatically.
Numba's vectorize allows Python functions taking scalar input arguments to be used as NumPy ufuncs. Creating a traditional NumPy ufunc is not the most straightforward process and involves writing some C code. Numba makes this easy.
The CUDA JIT is a low-level entry point to the CUDA features in Numba. It translates Python functions into PTX code which execute on the CUDA hardware. The jit decorator is applied to Python functions written in our Python dialect for CUDA.
You demonstrate a lack of very basic understanding of what vectorize even does, let alone cuda. Before you even look at this answer, you should be reading up here: https://numba.pydata.org/numba-doc/dev/user/vectorize.html
You seem to be missing basic information like, what does vectorize in general outside of a numba context even imply? Vector implies we are running a SIMD operation on some array aka vector input. Look at your code:
@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):
z = 0
n = 0
while abs(z) <= 2 and n < 80:
z = z*z + c
n += 1
return n
When you add that decorator, your converting this function into a vectorized version. With out the decorator it takes a scalar value, ie a single complex value. When you convert it, mandebrot will expect a vector of values, so that each can be run *in parallel. So can you spot the massive missuse of the function you've just created here?
def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):
for x in range(0, WIDTH):
for y in range(0, HEIGHT):
print(x)
# Convert pixel coordinate to complex number
c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
IM_START + (y / HEIGHT) * (IM_END - IM_START))
# Compute the number of iterations
m = mandelbrot(c)
# The color depends on the number of iterations
color = 255 - int(m * 255 / 80)
# Plot the point
draw.point([x, y], (color, color, color))
Your mandelbrot function is operating on scalar values within a loop. In other words, your using your vectorized function incorrectly and in the worst way possible. Look at this converted code:
def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):
complex_mat = np.empty((HEIGHT, WIDTH), dtype=np.complex128)
for x in range(0, WIDTH):
for y in range(0, HEIGHT):
print(x)
# Convert pixel coordinate to complex number
c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
IM_START + (y / HEIGHT) * (IM_END - IM_START))
complex_mat[y,x] = c
# Compute the number of iterations
m = mandelbrot(complex_mat)
for x in range(0, WIDTH):
for y in range(0, HEIGHT):
# The color depends on the number of iterations
color = 255 - int(m[y,x] * 255 / 80)
# Plot the point
draw.point([x, y], (color, color, color))
We first create the "vector" to be input into the "vectorized function", in this case any numpy array should do, it will just be applied element wise with the same shape output.
Now you'll still see this code is slow. Again there's another very basic lack of understanding going on that shows lack of prior research. I suggest you bench mark this code, and you do so before you ever come to SO asking for suggestions on how to improve the speed. You'll probably find it isn't even the "mandelbrot" code directly that causes the slow down. Everything else you've done is still serialized. You'll want to move your complex number generation and your mandelbrot and your point generation onto the GPU. I'm unsure of how to use numba to do this, but this is far beyond the scope of your question anyway, this may be of use,
https://github.com/numba/numba/issues/4309
it appears you will want to use the builtin cuda parallelization facilities instead of vectorize to make sure you don't have to pass useless data to the GPU (ie so you can just iterate over the pixels you need to generate values for, rather than pass the indexes of the pixels to CUDA).
An additional reason why the code is slow, beyond the passing back and forth of massive amounts of data between CPU and GPU is the use of complex128. GPUs sometimes do not have "fast" double precision, in particular Nvidia tends to gimp consumer GPUs double precision performance to the point where double precision can be 1/32 the speed of floating point. This is relevant because complex128 is actually 2 double precision values stuck together. complex64 may provide better speed. The problem with lower precision, which you won't likely face with this experiment, is that as you zoom a lot in on the mandelbrot set you may experience precision errors. There are techniques to fix this by seamlessly "wrapping" the function that calculates the mandelbrot set to prevent these artifacts. This is out of scope of this question however.
Finally, when I ran my modified code it worked fine. In otherwords, I did not have
File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
return 'compute_%d%d' % arch
TypeError: not enough arguments for format string
errors. If you still have this error while running my modified version, then you have some other configuration error, which is far too broad and out of scope of this question due to the lack of research, for example, it could be as basic as "did you install cuda" but we have no way of knowing with out a more focused question. Here is the output I generated (smaller so that it would fit withing the size requirements on SO). Note I did not replace
@vectorize(['complex128(complex128)'], target='cuda')
with
@vectorize(['int32(complex128)'], target='cuda')
and that is not an appropriate solution to your problem. Again this points to some user specific configuration error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With