I am learning OpenCL programming and am noticing something weird.
Namely, when I list all OpenCL enabled devices on my machine (Macbook Pro), I get the following list:
The first is my CPU, the second is the onboard graphics solution by Intel and the third is my dedicated graphics card.
Research shows that Intel has made their hardware OpenCL compatible so that I can tap into the power of the onboard graphics unit. That would be the Iris Pro.
With that in mind, what is the purpose of the CPU being OpenCL compatible? Is it merely for convenience so that kernels can be run on a CPU as backup should no other cards be found or is there any kind of speed advantage when running code as OpenCL kernels instead of regular (C, well threaded) programs on top of a CPU?
See https://software.intel.com/sites/default/files/m/d/4/1/d/8/Writing_Optimal_OpenCL_28tm_29_Code_with_Intel_28R_29_OpenCL_SDK.pdf for basic info.
Basically the Intel OpenCL compiler performs horizontal autovectorization for certain types of kernels. That means with SSE4 you get 8 threads running in parallel in a single core in similar fashion as Nvidia GPU runs 32 threads in a single 32 wide simd unit.
There are 2 major benefits on this approach: What happens if in 2 years they increase the SSE vector width to 16? Then you will instantly get autovectorization for 16 threads when you run on that CPU. No need to recompile your code. The second benefit is that it's far easier to write an OpenCL kernel that autovectorizes easily compared to writing it in ASM or C and getting your compiler to produce efficient code.
As OpenCL implementations mature, it's possible to achieve good levels of performance portability for your kernels across a wide range of devices. Some recent work in my research group shows that, in some cases, OpenCL codes achieve a similar fraction of hardware peak performance on the CPU and the GPU. On the CPU, the OpenCL kernels were being very effectively auto-vectorised by Intel's OpenCL CPU implementation. On the GPU, efficient code was being generated for HPC and desktop devices from Nvidia (who's OpenCL still works surprisingly well) and AMD.
If you want to develop your OpenCL code anyway in order to exploit the GPU, then you're often getting a fast multi-core+SIMD version "for free" by running the same code on the CPU.
For two recent papers from my group detailing the performance portability results we've achieved across four different real applications with OpenCL, see:
"On the performance portability of structured grid codes on many-core computer architectures", S.N. McIntosh-Smith, M. Boulton, D. Curran and J.R. Price. ISC, Leipzig, June 2014. DOI: 10.1007/978-3-319-07518-1_4
"High Performance in silico Virtual Drug Screening on Many-Core Processors", S. McIntosh-Smith, J. Price, R.B. Sessions, A.A. Ibarra, IJHPCA 2014. DOI: 10.1177/1094342014528252
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With