Vectorizing for cuda, a function that takes a complex number as input and a complex number as output fails in numba [closed]

Tags:

I took a program that makes a mandelbrot plot and made it run on a CPU thread using njit. Now I want to generate a 32k image but even a whole thread is too slow. So I tried to make the code run on a GPU. Here is the code:

from numba import njit, cuda, vectorize
from PIL import Image, ImageDraw


@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n


def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    for x in range(0, WIDTH):

        for y in range(0, HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))


def vai():
    # Image size (pixels)
    WIDTH = 15360
    HEIGHT = 8640

    # Plot window
    RE_START = -2
    RE_END = 1
    IM_START = -1
    IM_END = 1

    palette = []

    im = Image.new('RGB', (WIDTH, HEIGHT), (0, 0, 0))
    draw = ImageDraw.Draw(im)
    vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )

    im.save('output.png', 'PNG')

vai()

And here is the error:

D:\anaconda\python.exe C:/Users/techguy/PycharmProjects/mandelbrot/main.py
0
Traceback (most recent call last):
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 56, in <module>
    vai()
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 52, in vai
    vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 30, in vari
    m = mandelbrot(c)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 41, in __call__
    return CUDAUFuncMechanism.call(self.functions, args, kws)
  File "D:\anaconda\lib\site-packages\numba\np\ufunc\deviceufunc.py", line 301, in call
    cr.launch(func, shape[0], stream, devarys)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 152, in launch
    func.forall(count, stream=stream)(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 372, in __call__
    kernel = self.kernel.specialize(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 881, in specialize
    specialization = Dispatcher(self.py_func, [types.void(*argtypes)],
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 808, in __init__
    self.compile(sigs[0])
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 935, in compile
    kernel.bind()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 576, in bind
    self._func.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 446, in get
    ptx = self.ptx.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 414, in get
    arch = nvvm.get_arch_option(*cc)
  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

Process finished with exit code 1

If I substitute @vectorize with @njit(nogil=true) it works fine but it runs on CPU. I absolutely need it to run on GPU. I think the problem is something like the complex type.
What is the problem?

The code is not mine: I found it at How to plot the Mandelbrot set .

I just modified some pieces.

Here is a minimal reproducible example:

from numba import  cuda, vectorize

@vectorize(['int32(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n

comple = complex(10, 12)
print(mandelbrot(comple))

977

asked Feb 23 '21 16:02

TechTheGuy

1 Answers

You demonstrate a lack of very basic understanding of what vectorize even does, let alone cuda. Before you even look at this answer, you should be reading up here: https://numba.pydata.org/numba-doc/dev/user/vectorize.html

You seem to be missing basic information like, what does vectorize in general outside of a numba context even imply? Vector implies we are running a SIMD operation on some array aka vector input. Look at your code:

@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n

When you add that decorator, your converting this function into a vectorized version. With out the decorator it takes a scalar value, ie a single complex value. When you convert it, mandebrot will expect a vector of values, so that each can be run *in parallel. So can you spot the massive missuse of the function you've just created here?

def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    for x in range(0, WIDTH):

        for y in range(0, HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))

Your mandelbrot function is operating on scalar values within a loop. In other words, your using your vectorized function incorrectly and in the worst way possible. Look at this converted code:

def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    complex_mat = np.empty((HEIGHT, WIDTH), dtype=np.complex128)
    for x in range(0, WIDTH):
        for y in range(0, HEIGHT):
            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            complex_mat[y,x] = c


    # Compute the number of iterations
    m = mandelbrot(complex_mat)
    for x in range(0, WIDTH):
        for y in range(0, HEIGHT):
            # The color depends on the number of iterations
            color = 255 - int(m[y,x] * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))

We first create the "vector" to be input into the "vectorized function", in this case any numpy array should do, it will just be applied element wise with the same shape output.

Now you'll still see this code is slow. Again there's another very basic lack of understanding going on that shows lack of prior research. I suggest you bench mark this code, and you do so before you ever come to SO asking for suggestions on how to improve the speed. You'll probably find it isn't even the "mandelbrot" code directly that causes the slow down. Everything else you've done is still serialized. You'll want to move your complex number generation and your mandelbrot and your point generation onto the GPU. I'm unsure of how to use numba to do this, but this is far beyond the scope of your question anyway, this may be of use,

https://github.com/numba/numba/issues/4309

it appears you will want to use the builtin cuda parallelization facilities instead of vectorize to make sure you don't have to pass useless data to the GPU (ie so you can just iterate over the pixels you need to generate values for, rather than pass the indexes of the pixels to CUDA).

An additional reason why the code is slow, beyond the passing back and forth of massive amounts of data between CPU and GPU is the use of complex128. GPUs sometimes do not have "fast" double precision, in particular Nvidia tends to gimp consumer GPUs double precision performance to the point where double precision can be 1/32 the speed of floating point. This is relevant because complex128 is actually 2 double precision values stuck together. complex64 may provide better speed. The problem with lower precision, which you won't likely face with this experiment, is that as you zoom a lot in on the mandelbrot set you may experience precision errors. There are techniques to fix this by seamlessly "wrapping" the function that calculates the mandelbrot set to prevent these artifacts. This is out of scope of this question however.

Finally, when I ran my modified code it worked fine. In otherwords, I did not have

  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

errors. If you still have this error while running my modified version, then you have some other configuration error, which is far too broad and out of scope of this question due to the lack of research, for example, it could be as basic as "did you install cuda" but we have no way of knowing with out a more focused question. Here is the output I generated (smaller so that it would fit withing the size requirements on SO). Note I did not replace

@vectorize(['complex128(complex128)'], target='cuda')

with

@vectorize(['int32(complex128)'], target='cuda')

and that is not an appropriate solution to your problem. Again this points to some user specific configuration error.

enter image description here

128

answered Sep 27 '22 21:09

Krupip

Related questions
                            
                                What will happen to the allocated memory on GPU, after the application using it exits, if cudaFree() was not used?
                            
                                What type should I use for an index variable
                            
                                Calling CUDA code from regular C++ code - sorting out the extern "C"
                            
                                Why is cudaMalloc giving me an error when I know there is sufficient memory space?
                            
                                CUDA Dot Product
                            
                                Multiply Rectangular Matrices in CUDA
                            
                                CUDA function pointers
                            
                                CUDA core pipeline
                            
                                How can I modify xorg.conf file to force X server to run on a specific GPU? (I am using multiple GPUs) [closed]
                            
                                Does CUDA really not have a calloc()-like API call?
                            
                                drm.ko missing for CUDA 6.5 / Ubuntu 14.04 / AWS EC2 GPU instance g2.2xlarge
                            
                                Cuda GPU is slower than CPU in simple numpy operation
                            
                                Is it possible to install cupy on google colab?
                            
                                CUDA threads, SMX, SP and blocks, how do they work?
                            
                                Why can't member variables be shared?
                            
                                Efficient way to compute 3D indexes from 1D array representation
                            
                                CUDA 7.0 Error while compiling samples
                            
                                For CUDA, is there a guarantee that Ternary Operator can avoid branch divergence?
                            
                                Failed to compile cuda_ndarray.cu: libcublas.so.7.5: cannot open shared object file
                            
                                CUDA compiling error after installing it

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Vectorizing for cuda, a function that takes a complex number as input and a complex number as output fails in numba [closed]

Tags:

vectorization

cpython

cuda

numba

TechTheGuy

People also ask

1 Answers

Krupip

Recent Activity

Donate For Us