Having stumbled over this forum thread, dot product faster on cpu than on gpu using OpenCL, I was reminded again, that there are instances, which look like they're made for OpenCL*, but where they're used, OpenCL does not provided us with a gain. i.e. I also have a kmeans implementation using pyopencl code which is several times faster than a simple python code, but still several times faster than the scipy function for kmeans. So how do you decide when to use OpenCL? <ul> <li>What graphics card do you need? How much 'better than the cpu' does the graphics card have to be. Is Quadro FX 580 vs. i7 860 enough?</li> <li>How big does the problem have to be? Do you need millions of multiplications to gain something or are several hundreds enough?</li> <li>How much optimizing of an even 'simple' algorithm like kmeans or the dot product is necessary to make OpenCL worthwhile?</li> </ul> Or is it one of these triangle cases, where you only can (/have to) choose two of the three corners to make it work? <blockquote> <pre class="prettyprint"><code> problem size /\ / \ / \ / \ /________\ GPU/CPU optimization </code></pre> </blockquote> I know, that I used a little bit too bold of language for the title and the questions. I'll change it, if I can think of a more suitable wording. Thanks. * simple matrix operation like dot product, kmeans or matrix multiplications

The real key should be whether your algorithm has a lot of inherent parallelization in it where you can hand over a data set and have significant amount of parallel processing happen on it. Remember a GPU may have many many cores, but they each only clock .5-1GHZ. The strength is in processing large amounts of parallel operations to get extremely high throughput. Consider throughput as (data computed * frequency * pipeline stages) - so there's going to be a tradeoff of going with say 1/6th the frequency with one of those GPU cores, but probably more than 6* the number of cores (pipeline stages). Of course there's additional overhead of the CPU <-> GPU barrier, and also your algorithm could result in multiple GPU clock cycles to compute..

When to use OpenCL?

Tags:

opencl

Having stumbled over this forum thread, dot product faster on cpu than on gpu using OpenCL, I was reminded again, that there are instances, which look like they're made for OpenCL*, but where they're used, OpenCL does not provided us with a gain. i.e. I also have a kmeans implementation using pyopencl code which is several times faster than a simple python code, but still several times faster than the scipy function for kmeans.

So how do you decide when to use OpenCL?

What graphics card do you need? How much 'better than the cpu' does the graphics card have to be. Is Quadro FX 580 vs. i7 860 enough?
How big does the problem have to be? Do you need millions of multiplications to gain something or are several hundreds enough?
How much optimizing of an even 'simple' algorithm like kmeans or the dot product is necessary to make OpenCL worthwhile?

Or is it one of these triangle cases, where you only can (/have to) choose two of the three corners to make it work?

    problem size
        /\
       /  \
      /    \
     /      \
    /________\
GPU/CPU   optimization

I know, that I used a little bit too bold of language for the title and the questions. I'll change it, if I can think of a more suitable wording.

Thanks.

* simple matrix operation like dot product, kmeans or matrix multiplications

360

asked Apr 20 '11 12:04

Framester

1 Answers

The real key should be whether your algorithm has a lot of inherent parallelization in it where you can hand over a data set and have significant amount of parallel processing happen on it. Remember a GPU may have many many cores, but they each only clock .5-1GHZ. The strength is in processing large amounts of parallel operations to get extremely high throughput.

Consider throughput as (data computed * frequency * pipeline stages) - so there's going to be a tradeoff of going with say 1/6th the frequency with one of those GPU cores, but probably more than 6* the number of cores (pipeline stages).

Of course there's additional overhead of the CPU <-> GPU barrier, and also your algorithm could result in multiple GPU clock cycles to compute..

154

answered Sep 17 '22 19:09

Nektarios

Related questions
                            
                                When will OpenCL 1.2 for NVIDIA hardware be available?
                            
                                HPC programming language relying on implicit vectorization
                            
                                Using Delphi to take advantage of GPGPU technology?
                            
                                Number of Compute Units corresponding to the number of work groups
                            
                                Confusion on CUDA/openCL and C++ AMP
                            
                                OpenCL - is it possible to invoke another function from within a kernel?
                            
                                OpenCL for Python
                            
                                How can I test for OpenCL compatibility?
                            
                                How does the Levenberg–Marquardt algorithm work in detail but in an understandable way?
                            
                                Why does my OpenCL kernel fail on the nVidia driver, but not Intel (possible driver bug)?
                            
                                In OpenCL, what does mem_fence() do, as opposed to barrier()?
                            
                                PyOpenCl: how to debug segmentation fault?
                            
                                Using Python+Theano with OpenCL in an AMD GPU
                            
                                How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?
                            
                                OpenCL GPU Audio
                            
                                In OpenCL, what is the difference between platform, context, and device?
                            
                                Preprocessor tomfoolery (stringifying a #include)
                            
                                GPU vs CPU performance for common algorithms [closed]
                            
                                How do JavaCL and JogAmp JOCL compare?
                            
                                "Unrolling" a recursive function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With