WHAT I WANT TO DO:
I have a script that I use for factorizing prime numbers given a certain range:
# Python program to display all the prime numbers within an interval
lower = 900
upper = 1000
print("Prime numbers between", lower, "and", upper, "are:")
for num in range(lower, upper + 1):
# all prime numbers are greater than 1
if num > 1:
for i in range(2, num):
if (num % i) == 0:
break
else:
print(num)
I would like to use the GPU instead of the CPU to run such script so it would be faster
THE PROBLEM:
I don't have a NVIDIA GPU on my Intel NUC NUC8i7HVK but a "Discrete GPU"
If I run this code to check what are my GPUs:
import pyopencl as cl
import numpy as np
a = np.arange(32).astype(np.float32)
res = np.empty_like(a)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)
prg = cl.Program(ctx, """
__kernel void sq(__global const float *a,
__global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] * a[gid];
}
""").build()
prg.sq(queue, a.shape, None, a_buf, dest_buf)
cl.enqueue_copy(queue, res, dest_buf)
print (a, res)
I receive:
[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ffb3d492fd0>
[1] <pyopencl.Platform 'Intel(R) OpenCL HD Graphics' at 0x187b648ed80>
THE POSSIBLE APPROACH TO THE PROBLEM:
I found a guide that takes you by the hand and explains step by step how to run it on your GPU. But all Pyhton libraries that pipes Python through the GPU like PyOpenGL, PyOpenCL, Tensorflow (Force python script on GPU), PyTorch, etc... are tailored for NVIDIA.
In case you have an AMD all libraries ask for ROCm but such software still doesn't support integrated GPU or Discrete GPU as far as I know (see my own reply below).
I only found a guide that talks about such approach but I cannot make it work.
Is there hope or I'm just tying to do something impossible?
EDIT: Reply to @chapelo
If I choose 0
the reply is:
Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81. 100. 121. 144. 169.
196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
784. 841. 900. 961.]
If I choose 1
the reply is:
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81. 100. 121. 144. 169.
196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
784. 841. 900. 961.]
After extensive research and several try I reached the conclusion:
I installed ROCm but if I run rocminfo
it returns:
ROCk module is NOT loaded, possibly no GPU devices
Unable to open /dev/kfd read-write: No such file or directory
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
clinfo
returns:
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.0 AMD-APP (3212.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback
Platform Extensions function suffix AMD
Platform Name AMD Accelerated Parallel Processing
Number of devices 0
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform
rocm-smi
returns:
Segmentation fault
This because in the official guide it says that "The integrated GPUs of Ryzen are not officially supported targets for ROCm." and because mine is an integrated GPU I'm out of scope.
I will stop wasting my time and probably buy an NVIDIA or AMD eGPU (external GPU)
pyopencl
does work with both your AMD and your Intel GPUs. And you checked that your installation is working. Only set your environment variable PYOPENCL_CTX='0'
to use the AMD every time without being asked.
Or instead of using ctx = cl.create_some_context()
, you could define the context in your program by using:
platforms = cl.get_platforms()
ctx = cl.Context(
dev_type=cl.device_type.ALL,
properties=[(cl.context_properties.PLATFORM, platforms[0])])
Don't take for granted that your AMD is better than your Intel for every case. I have had cases when the Intel surpasses the other one. I think this has to do with the cost of copying the data outside the CPU to the other GPU.
Having said that, I think that running your script in parallel won't be too much of an improvement, as compared to having a better algorithm:
Perhaps this is not a good example of an algorithm that can be easily run in parallel, but you are all set up to try another example.
However, to show you how you can solve this problem using your GPU, consider the following changes:
The serial algorithm would be something like the following:
from math import sqrt
def primes_below(number):
n = lambda a: 2 if a==0 else 2*a + 1
limit = int(sqrt(number)) + 1
size = number//2
primes = [True] * size
for i in range(1, size):
if primes[i]:
num = n(i)
for j in range(i+num, size, num):
primes[j] = False
for i, flag in enumerate(primes):
if flag: yield n(i)
def primes_between(lo, hi):
primes = list(primes_below(int(sqrt(hi))+1))
size = (hi - lo - (0 if hi%2 else 1))//2 + 1
n = lambda a: 2*a + lo + (0 if lo%2 else 1)
numbers = [True]*size
for i, prime in enumerate(primes):
if i == 0: continue
start = 0
while (n(start)%prime) != 0:
start += 1
for j in range(start, size, prime):
numbers[j] = False
for i, flag in enumerate(numbers):
if flag: yield n(i)
This prints a list of the primes between 1e6 and 5e6, in 0.64 seconds
Trying to use your script with my GPU didn't finish in more than 5 minutes. For a 10 times smaller problem: the primes between 1e5 and 5e5, it took roughly 29 seconds.
Modifying the script so that each process in the GPU divides one odd number (there's no point testing even numbers) by a list of pre-computed primes up to the square root of the upper number, stopping if the prime is greater than the square root of the number itself, it completes the same task in 0.50 seconds. That's an improvement!
The code is the following:
import numpy as np
import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array
def primes_between_using_cl(lo, hi):
primes = list(primes_below(int(sqrt(hi))+1))
numbers_h = np.arange( lo + (0 if lo&1 else 1),
hi + (0 if hi&1 else 1),
2,
dtype=np.int32)
size = (hi - lo - (0 if hi%2 else 1))//2 + 1
code = """\
__kernel
void is_prime( __global const int *primes,
__global int *numbers) {
int gid = get_global_id(0);
int num = numbers[gid];
int max = (int) (sqrt((float)num) + 1.0);
for (; *primes; ++primes) {
if (*primes <= max && num % *primes == 0) {
numbers[gid] = 0;
return;
}
}
}
"""
platforms = cl.get_platforms()
ctx = cl.Context(dev_type=cl.device_type.ALL,
properties=[(cl.context_properties.PLATFORM, platforms[0])])
queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, code).build()
numbers_d = cl.array.to_device(queue, numbers_h)
primes_d = cl.array.to_device(queue,
np.array(primes[1:], # don't need 2
dtype=np.int32))
prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)
array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]
yield from array.get()[:length.get()]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With