We are currently trying to optimize a system in which there are at least 12 variables. Total comibination of these variable is over 1 billion. This is not deep learning or machine learning or Tensorflow or whatsoever but arbitrary calculation on time series data.
We have implemented our code in Python and successfully run it on CPU. We also tried multiprocessing which also works well but we need faster computation since calculation takes weeks. We have a GPU system consisting of 6 AMD GPUs. We would like to run our code on this GPU system but do not know how to do so.
My questions are:
We read that we need to adjust the code for GPU computation but we do not know how to do that.
PS: I can add more information if you need. I tried to keep the post as simple as possible to avoid conflict.
PyTorch can be installed as Python package on AMD GPUs with ROCm 4.0 and above.
Thus, running a python script on GPU can prove to be comparatively faster than CPU, however, it must be noted that for processing a data set with GPU, the data will first be transferred to the GPU's memory which may require additional time so if data set is small then CPU may perform better than GPU.
Presentations. Radeon™ Machine Learning (Radeon™ ML or RML) is an AMD SDK for high-performance deep learning inference on GPUs. This library is designed to support any desktop OS and any vendor's GPU with a single API to simplify the usage of ML inference.
Single-Node Server Requirements. Before you can run an AMD machine learning framework container, your Docker environment must support AMD GPUs. X86_64 CPU(s). Note: The AMD TensorFlow framework container assumes that the server contains the required x86-64 CPU(s) and at least one of the listed AMD GPUs.
There are at least two options to speed up calculations using the GPU:
But I usually don't recommend to run code on the GPU from the start. Calculations on the GPU are not always faster. Depending on how complex they are and how good your implementations on the CPU and GPU are. If you follow the list below you can get a good idea on what to expect.
If your code is pure Python (list, float, for-loops etc.) you can see a a huge speed-up (maybe up to 100 x) by using vectorized Numpy code. This is also an important step to find out how your GPU code could be implemented as the calculations in vectorized Numpy will have a similar scheme. The GPU performs better at small tasks that can be parallelized.
Once you have a well optimized Numpy example you can try to get a first peek on the GPU speed-up by using Numba. For simple cases you can just decorate your Numpy functions to run on the GPU. You can expect a speed-up of 100 to 500 compared to Numpy code, if your problem can be parallelized / vectorized.
You may have gotten so far without writing any OpenCL C code for the GPU but still have your code running on it. But if your problem is too complex, you will have to write custom code and run it using PyOpenCL. Expected speed-up is also 100 to 500 compared to good Numpy code.
The important thing to rembemer is that the GPU is only powerful if you use it correctly and only for a certain set of problems.
If you have a small example of your code feel free to post it.
Another thing to say is that CUDA is often easier to use than OpenCL. There are more libraries, more examples, more documentation, more support. Nvidia did a very good job on not supporting OpenCL well from the very start. I usually perfer open standards, but we moved to CUDA and Nvidia hardware quickly when things became business and commercial.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With