Having stumbled over this forum thread, dot product faster on cpu than on gpu using OpenCL, I was reminded again, that there are instances, which look like they're made for OpenCL*, but where they're used, OpenCL does not provided us with a gain. i.e. I also have a kmeans implementation using pyopencl code which is several times faster than a simple python code, but still several times faster than the scipy function for kmeans.
So how do you decide when to use OpenCL?
Or is it one of these triangle cases, where you only can (/have to) choose two of the three corners to make it work?
problem size /\ / \ / \ / \ /________\ GPU/CPU optimization
I know, that I used a little bit too bold of language for the title and the questions. I'll change it, if I can think of a more suitable wording.
Thanks.
* simple matrix operation like dot product, kmeans or matrix multiplications
OpenCL is the primary language used to run graphics processing on AMD GPUs. Intel offers GPUs integrated into its CPUs. OpenCL can run on these GPUs, but while sufficient for laptops, it does not perform competitive performance for general-purpose computations.
CUDA is more modern and stable than OpenCL and has very good backwards compatibility. Nvidia is more focused on General Purpose GPU Programming, AMD is more focused on gaming. Most GPU programming is done on CUDA. Usually you won't get more than one compiler for GPU programming in any 'language'.
For example, GPU programming has been used to accelerate video, digital image, and audio signal processing, statistical physics, scientific computing, medical imaging, computer vision, neural networks and deep learning, cryptography, and even intrusion detection, among many other areas.
The real key should be whether your algorithm has a lot of inherent parallelization in it where you can hand over a data set and have significant amount of parallel processing happen on it. Remember a GPU may have many many cores, but they each only clock .5-1GHZ. The strength is in processing large amounts of parallel operations to get extremely high throughput.
Consider throughput as (data computed * frequency * pipeline stages) - so there's going to be a tradeoff of going with say 1/6th the frequency with one of those GPU cores, but probably more than 6* the number of cores (pipeline stages).
Of course there's additional overhead of the CPU <-> GPU barrier, and also your algorithm could result in multiple GPU clock cycles to compute..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With