OpenCL CPU Device vs GPU Device

Tags:

opencl

Consider a simple example: vector addition.

If I build a program for CL_DEVICE_TYPE_GPU, and I build the same program for CL_DEVICE_TYPE_CPU, what is the difference between them(except that "CPU program" is running on CPU, and "GPU program" is running on GPU)?

Thanks for your help.

558

asked Feb 09 '12 05:02

K0n57an71n

1 Answers

There are a few differences between the device types. The simple answer to your vector question is: Use a gpu for large vectors, and cpu for smaller workloads.

1) Memory copying. GPUs rely on the data you are working on to be passed into them, and the results are later read back to the host. This is done over PCI-e, which yields about 5GB/s for version 2.0 / 2.1. CPUs can use buffers 'in place' - in DDR3 - using either of the CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR flags. See here: clCreateBuffer. This is one of the big bottlenecks for many kernels.

2) Clock speed. cpus currently have a big lead over gpus in clock speed. 2Ghz on the low end for most cpus, vs 1Ghz as a top end for most gpus these days. This is one factor that really helps the cpu 'win' over a gpu for small workloads.

3) Concurrent 'threads'. High-end gpus usually have more compute units than their cpu counterparts. For example, the 6970 gpu (Cayman) has 24 opencl compute units, each of these is divided into 16 SIMD units. Most of the top desktop cpus have 8 cores, and server cpus currently stop at 16 cores. (cpu cores map 1:1 to compute unit count) A compute unit in opencl is a portion of the device which can do work that is different from the rest of the device.

4) Thread types. gpus have a SIMD architecture, with many graphic-oriented instructions. cpus have a lot of their area dedicated to branch prediction and general computations. A cpu may have a SIMD unit and/or floating point unit in every core, but the Cayman chip I mentioned above has 1536 units with the gpu instruction set available to each one. AMD calls them stream processors, and there are 4 in each of the SIMD units mentioned above (24x16x4 = 1536). No cpu will have that many sin(x) or dot-product-capable units unless the manufacturer wants to cut out some cache memory or branch prediction hardware. The SIMD layout of the gpus is probably the largest 'win' for large vector addition situations. That the also do other specialized functions is a big bonus.

5) Memory Bandwidth. cpus with DDR3: ~17GB/s. High-end gpus >100GB/s, speeds of over 200GB/s are becoming common lately. If your algorithm is not PCI-e limited (see #1), the gpu will outpace the cpu in raw memory access. The scheduling units in a gpu can hide memory latency further by running only tasks that aren't waiting on memory access. AMD calls this a wavefront, Nvidia calls it a warp. cpus have a large and complicated caching system to help hide their memory access times in the case where the program is reusing the data. For your vector add problem, you will likely be limited more by the PCI-e bus since the vectors are generally used only once or twice each.

6) Power efficiency. A gpu (used properly) will usually be more electrically efficient than a cpu. Because cpus dominate in clock speed, one of the only ways to really reduce power consumption is to down-clock the chip. This obviously leads to longer compute times. Many of the top systems on the Green 500 list are heavily gpu accelerated. see here: green500.org

131

answered Dec 23 '22 13:12

mfa

Related questions
                            
                                macro for simulating access two dimensional array in C
                            
                                Getting Started with OpenCL on Windows 7
                            
                                Getting started with PyOpenCL
                            
                                How to optimize OpenCL code for neighbors accessing?
                            
                                OpenCL and GPU programming Roadmap
                            
                                Measuring execution time of OpenCL kernels
                            
                                Is there a limit to OpenCL local memory?
                            
                                How to determine max size of images I can safely pass to/from OpenCL kernel?
                            
                                How to setup OpenCL on AMD videocard with opensource driver?
                            
                                Are there any good 3rd party libraries build on top of openCL yet?
                            
                                What is the algorithm to determine optimal work group size and number of workgroup
                            
                                Aligning GPU memory accesses of an image convolution (OpenCL/CUDA) kernel
                            
                                Does AMD's OpenCL offer something similar to CUDA's GPUDirect?
                            
                                Disassemble an OpenCL kernel?
                            
                                Branch predication on GPU
                            
                                Getting starting with Parallel programming [closed]
                            
                                How to pass and access C++ vectors to OpenCL kernel?
                            
                                Benchmarks comparing Intel Xeon Phi and Nvidia Tesla K20
                            
                                Can I use Julia to program my GPU & CPU?
                            
                                How to represent scientific notation in C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With