Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing for cuda, a function that takes a complex number as input and a complex number as output fails in numba [closed]

I took a program that makes a mandelbrot plot and made it run on a CPU thread using njit. Now I want to generate a 32k image but even a whole thread is too slow. So I tried to make the code run on a GPU. Here is the code:

from numba import njit, cuda, vectorize
from PIL import Image, ImageDraw


@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n


def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    for x in range(0, WIDTH):

        for y in range(0, HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))


def vai():
    # Image size (pixels)
    WIDTH = 15360
    HEIGHT = 8640

    # Plot window
    RE_START = -2
    RE_END = 1
    IM_START = -1
    IM_END = 1

    palette = []

    im = Image.new('RGB', (WIDTH, HEIGHT), (0, 0, 0))
    draw = ImageDraw.Draw(im)
    vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )

    im.save('output.png', 'PNG')

vai()

And here is the error:

D:\anaconda\python.exe C:/Users/techguy/PycharmProjects/mandelbrot/main.py
0
Traceback (most recent call last):
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 56, in <module>
    vai()
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 52, in vai
    vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 30, in vari
    m = mandelbrot(c)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 41, in __call__
    return CUDAUFuncMechanism.call(self.functions, args, kws)
  File "D:\anaconda\lib\site-packages\numba\np\ufunc\deviceufunc.py", line 301, in call
    cr.launch(func, shape[0], stream, devarys)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 152, in launch
    func.forall(count, stream=stream)(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 372, in __call__
    kernel = self.kernel.specialize(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 881, in specialize
    specialization = Dispatcher(self.py_func, [types.void(*argtypes)],
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 808, in __init__
    self.compile(sigs[0])
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 935, in compile
    kernel.bind()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 576, in bind
    self._func.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 446, in get
    ptx = self.ptx.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 414, in get
    arch = nvvm.get_arch_option(*cc)
  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

Process finished with exit code 1

If I substitute @vectorize with @njit(nogil=true) it works fine but it runs on CPU. I absolutely need it to run on GPU. I think the problem is something like the complex type.
What is the problem?

The code is not mine: I found it at How to plot the Mandelbrot set .

I just modified some pieces.

Here is a minimal reproducible example:

from numba import  cuda, vectorize

@vectorize(['int32(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n

comple = complex(10, 12)
print(mandelbrot(comple))
like image 977
TechTheGuy Avatar asked Feb 23 '21 16:02

TechTheGuy


People also ask

How does Numba Cuda work?

Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model. Kernels written in Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between the CPU and the GPU automatically.

What is vectorize in Numba?

Numba's vectorize allows Python functions taking scalar input arguments to be used as NumPy ufuncs. Creating a traditional NumPy ufunc is not the most straightforward process and involves writing some C code. Numba makes this easy.

What is Cuda JIT?

The CUDA JIT is a low-level entry point to the CUDA features in Numba. It translates Python functions into PTX code which execute on the CUDA hardware. The jit decorator is applied to Python functions written in our Python dialect for CUDA.


1 Answers

You demonstrate a lack of very basic understanding of what vectorize even does, let alone cuda. Before you even look at this answer, you should be reading up here: https://numba.pydata.org/numba-doc/dev/user/vectorize.html

You seem to be missing basic information like, what does vectorize in general outside of a numba context even imply? Vector implies we are running a SIMD operation on some array aka vector input. Look at your code:

@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n

When you add that decorator, your converting this function into a vectorized version. With out the decorator it takes a scalar value, ie a single complex value. When you convert it, mandebrot will expect a vector of values, so that each can be run *in parallel. So can you spot the massive missuse of the function you've just created here?

def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    for x in range(0, WIDTH):

        for y in range(0, HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))

Your mandelbrot function is operating on scalar values within a loop. In other words, your using your vectorized function incorrectly and in the worst way possible. Look at this converted code:

def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    complex_mat = np.empty((HEIGHT, WIDTH), dtype=np.complex128)
    for x in range(0, WIDTH):
        for y in range(0, HEIGHT):
            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            complex_mat[y,x] = c


    # Compute the number of iterations
    m = mandelbrot(complex_mat)
    for x in range(0, WIDTH):
        for y in range(0, HEIGHT):
            # The color depends on the number of iterations
            color = 255 - int(m[y,x] * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))

We first create the "vector" to be input into the "vectorized function", in this case any numpy array should do, it will just be applied element wise with the same shape output.

Now you'll still see this code is slow. Again there's another very basic lack of understanding going on that shows lack of prior research. I suggest you bench mark this code, and you do so before you ever come to SO asking for suggestions on how to improve the speed. You'll probably find it isn't even the "mandelbrot" code directly that causes the slow down. Everything else you've done is still serialized. You'll want to move your complex number generation and your mandelbrot and your point generation onto the GPU. I'm unsure of how to use numba to do this, but this is far beyond the scope of your question anyway, this may be of use,

https://github.com/numba/numba/issues/4309

it appears you will want to use the builtin cuda parallelization facilities instead of vectorize to make sure you don't have to pass useless data to the GPU (ie so you can just iterate over the pixels you need to generate values for, rather than pass the indexes of the pixels to CUDA).

An additional reason why the code is slow, beyond the passing back and forth of massive amounts of data between CPU and GPU is the use of complex128. GPUs sometimes do not have "fast" double precision, in particular Nvidia tends to gimp consumer GPUs double precision performance to the point where double precision can be 1/32 the speed of floating point. This is relevant because complex128 is actually 2 double precision values stuck together. complex64 may provide better speed. The problem with lower precision, which you won't likely face with this experiment, is that as you zoom a lot in on the mandelbrot set you may experience precision errors. There are techniques to fix this by seamlessly "wrapping" the function that calculates the mandelbrot set to prevent these artifacts. This is out of scope of this question however.

Finally, when I ran my modified code it worked fine. In otherwords, I did not have

  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

errors. If you still have this error while running my modified version, then you have some other configuration error, which is far too broad and out of scope of this question due to the lack of research, for example, it could be as basic as "did you install cuda" but we have no way of knowing with out a more focused question. Here is the output I generated (smaller so that it would fit withing the size requirements on SO). Note I did not replace

@vectorize(['complex128(complex128)'], target='cuda')

with

@vectorize(['int32(complex128)'], target='cuda')

and that is not an appropriate solution to your problem. Again this points to some user specific configuration error.

enter image description here

like image 128
Krupip Avatar answered Sep 27 '22 21:09

Krupip