Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run Python script on a Discrete Graphics AMD GPU?

WHAT I WANT TO DO:

I have a script that I use for factorizing prime numbers given a certain range:

# Python program to display all the prime numbers within an interval

lower = 900
upper = 1000

print("Prime numbers between", lower, "and", upper, "are:")

for num in range(lower, upper + 1):
   # all prime numbers are greater than 1
   if num > 1:
       for i in range(2, num):
           if (num % i) == 0:
               break
       else:
           print(num)

I would like to use the GPU instead of the CPU to run such script so it would be faster

THE PROBLEM:

I don't have a NVIDIA GPU on my Intel NUC NUC8i7HVK but a "Discrete GPU"

enter image description here

If I run this code to check what are my GPUs:

import pyopencl as cl
import numpy as np

a = np.arange(32).astype(np.float32)
res = np.empty_like(a)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

prg = cl.Program(ctx, """
    __kernel void sq(__global const float *a,
    __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] * a[gid];
    }
    """).build()

prg.sq(queue, a.shape, None, a_buf, dest_buf)

cl.enqueue_copy(queue, res, dest_buf)

print (a, res)

I receive:

  • [0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ffb3d492fd0>
  • [1] <pyopencl.Platform 'Intel(R) OpenCL HD Graphics' at 0x187b648ed80>

THE POSSIBLE APPROACH TO THE PROBLEM:

I found a guide that takes you by the hand and explains step by step how to run it on your GPU. But all Pyhton libraries that pipes Python through the GPU like PyOpenGL, PyOpenCL, Tensorflow (Force python script on GPU), PyTorch, etc... are tailored for NVIDIA.

In case you have an AMD all libraries ask for ROCm but such software still doesn't support integrated GPU or Discrete GPU as far as I know (see my own reply below).

I only found a guide that talks about such approach but I cannot make it work.

Is there hope or I'm just tying to do something impossible?

EDIT: Reply to @chapelo

If I choose 0 the reply is:

Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

If I choose 1 the reply is:

Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]
like image 612
Francesco Mantovani Avatar asked Jan 25 '23 13:01

Francesco Mantovani


2 Answers

After extensive research and several try I reached the conclusion:

  1. PyOpenGL: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
  2. PyOpenCL: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
  3. TensorFlow: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
  4. PyTorch: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm

I installed ROCm but if I run rocminfo it returns:

ROCk module is NOT loaded, possibly no GPU devices
Unable to open /dev/kfd read-write: No such file or directory
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

clinfo returns:

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3212.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

rocm-smi returns:

Segmentation fault

This because in the official guide it says that "The integrated GPUs of Ryzen are not officially supported targets for ROCm." and because mine is an integrated GPU I'm out of scope.

I will stop wasting my time and probably buy an NVIDIA or AMD eGPU (external GPU)

like image 182
Francesco Mantovani Avatar answered Jan 27 '23 02:01

Francesco Mantovani


pyopencl does work with both your AMD and your Intel GPUs. And you checked that your installation is working. Only set your environment variable PYOPENCL_CTX='0' to use the AMD every time without being asked.

Or instead of using ctx = cl.create_some_context(), you could define the context in your program by using:

platforms = cl.get_platforms()
ctx = cl.Context(
   dev_type=cl.device_type.ALL,
   properties=[(cl.context_properties.PLATFORM, platforms[0])])

Don't take for granted that your AMD is better than your Intel for every case. I have had cases when the Intel surpasses the other one. I think this has to do with the cost of copying the data outside the CPU to the other GPU.

Having said that, I think that running your script in parallel won't be too much of an improvement, as compared to having a better algorithm:

  • with a sieving algorithm, get the prime numbers up to the square root of your upper number.
  • applying a similar sieving algorithm, use the primes from the previous step to sieve numbers from your lower to upper bounds.

Perhaps this is not a good example of an algorithm that can be easily run in parallel, but you are all set up to try another example.

However, to show you how you can solve this problem using your GPU, consider the following changes:

The serial algorithm would be something like the following:

from math import sqrt

def primes_below(number):
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag: yield n(i)

def primes_between(lo, hi):
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag: yield n(i)

This prints a list of the primes between 1e6 and 5e6, in 0.64 seconds

Trying to use your script with my GPU didn't finish in more than 5 minutes. For a 10 times smaller problem: the primes between 1e5 and 5e5, it took roughly 29 seconds.

Modifying the script so that each process in the GPU divides one odd number (there's no point testing even numbers) by a list of pre-computed primes up to the square root of the upper number, stopping if the prime is greater than the square root of the number itself, it completes the same task in 0.50 seconds. That's an improvement!

The code is the following:

import numpy as np
import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_between_using_cl(lo, hi):

    primes = list(primes_below(int(sqrt(hi))+1))

    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                            hi + (0 if hi&1 else 1),
                            2,
                            dtype=np.int32)

    size = (hi - lo - (0 if hi%2 else 1))//2 + 1

    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                   __global       int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
   
        if (*primes <= max && num % *primes == 0) {
          numbers[gid] = 0;
          return;
        }
      }
    }
    """

    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
       properties=[(cl.context_properties.PLATFORM, platforms[0])])     
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)

    primes_d = cl.array.to_device(queue,
                                  np.array(primes[1:], # don't need 2
                                  dtype=np.int32))

    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)

    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]

    yield from array.get()[:length.get()]
like image 31
chapelo Avatar answered Jan 27 '23 01:01

chapelo