Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCL CPU Device vs GPU Device

Tags:

opencl

Consider a simple example: vector addition.

If I build a program for CL_DEVICE_TYPE_GPU, and I build the same program for CL_DEVICE_TYPE_CPU, what is the difference between them(except that "CPU program" is running on CPU, and "GPU program" is running on GPU)?

Thanks for your help.

like image 558
K0n57an71n Avatar asked Feb 09 '12 05:02

K0n57an71n


People also ask

Is OpenCL GPU or CPU?

OpenCL™ (Open Computing Language) is a low-level API for heterogeneous computing that runs on CUDA-powered GPUs. Using the OpenCL API, developers can launch compute kernels written using a limited subset of the C programming language on a GPU.

Does OpenCL use CPU?

 OpenCL can use CPUs as a compute device just it can for GPUs.  There is no local memory, CPUs cache is utilized in OpenCL just like any normal CPU program.

What is OpenCL CPU?

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.

Can GPU be used as OpenCL host?

The OpenCL programming language can run on NVIDIA GPUs, AMD GPUs, Intel CPUs, Intel GPUs, mobile devices, and (supposedly) FPGAs (Field-Programmable Gate Arrays). But, OpenCL is at its best on compute devices with large amounts of data parallelism, which usually implies GPU usage.


1 Answers

There are a few differences between the device types. The simple answer to your vector question is: Use a gpu for large vectors, and cpu for smaller workloads.

1) Memory copying. GPUs rely on the data you are working on to be passed into them, and the results are later read back to the host. This is done over PCI-e, which yields about 5GB/s for version 2.0 / 2.1. CPUs can use buffers 'in place' - in DDR3 - using either of the CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR flags. See here: clCreateBuffer. This is one of the big bottlenecks for many kernels.

2) Clock speed. cpus currently have a big lead over gpus in clock speed. 2Ghz on the low end for most cpus, vs 1Ghz as a top end for most gpus these days. This is one factor that really helps the cpu 'win' over a gpu for small workloads.

3) Concurrent 'threads'. High-end gpus usually have more compute units than their cpu counterparts. For example, the 6970 gpu (Cayman) has 24 opencl compute units, each of these is divided into 16 SIMD units. Most of the top desktop cpus have 8 cores, and server cpus currently stop at 16 cores. (cpu cores map 1:1 to compute unit count) A compute unit in opencl is a portion of the device which can do work that is different from the rest of the device.

4) Thread types. gpus have a SIMD architecture, with many graphic-oriented instructions. cpus have a lot of their area dedicated to branch prediction and general computations. A cpu may have a SIMD unit and/or floating point unit in every core, but the Cayman chip I mentioned above has 1536 units with the gpu instruction set available to each one. AMD calls them stream processors, and there are 4 in each of the SIMD units mentioned above (24x16x4 = 1536). No cpu will have that many sin(x) or dot-product-capable units unless the manufacturer wants to cut out some cache memory or branch prediction hardware. The SIMD layout of the gpus is probably the largest 'win' for large vector addition situations. That the also do other specialized functions is a big bonus.

5) Memory Bandwidth. cpus with DDR3: ~17GB/s. High-end gpus >100GB/s, speeds of over 200GB/s are becoming common lately. If your algorithm is not PCI-e limited (see #1), the gpu will outpace the cpu in raw memory access. The scheduling units in a gpu can hide memory latency further by running only tasks that aren't waiting on memory access. AMD calls this a wavefront, Nvidia calls it a warp. cpus have a large and complicated caching system to help hide their memory access times in the case where the program is reusing the data. For your vector add problem, you will likely be limited more by the PCI-e bus since the vectors are generally used only once or twice each.

6) Power efficiency. A gpu (used properly) will usually be more electrically efficient than a cpu. Because cpus dominate in clock speed, one of the only ways to really reduce power consumption is to down-clock the chip. This obviously leads to longer compute times. Many of the top systems on the Green 500 list are heavily gpu accelerated. see here: green500.org

like image 131
mfa Avatar answered Dec 23 '22 13:12

mfa