Progressing unification of CPU and GPU hardware, as evidenced by AMD Kaveri with hUMA (heterogeneous Uniform Memory Access) and Intel 4th generation CPUs, should allow copy-free sharing of data between CPU and GPU. I would like to know, if the most recent OpenCL (or other GPGPU framework) implementations allow true copy-free sharing (no explicit or implicit data copying) of large data structure between code running on CPU and GPU.
CUDA and OpenCL offer two different interfaces for programming GPUs. OpenCL is an open standard that can be used to program CPUs, GPUs, and other devices from different vendors, while CUDA is specific to NVIDIA GPUs.
OpenCL™ (Open Computing Language) is a low-level API for heterogeneous computing that runs on CUDA-powered GPUs. Using the OpenCL API, developers can launch compute kernels written using a limited subset of the C programming language on a GPU.
OpenCL can use CPUs as a compute device just it can for GPUs. There is no local memory, CPUs cache is utilized in OpenCL just like any normal CPU program.
All CPUs support OpenCL 1.2 only. NVIDIA: NVIDIA GeForce 8600M GT, GeForce 8800 GT, GeForce 8800 GTS, GeForce 9400M, GeForce 9600M GT, GeForce GT 120, GeForce GT 130, ATI Radeon 4850, Radeon 4870, and likely more are supported.
The ability to share data between host and device without any memory transfers has been available in OpenCL from version 1.0, via the CL_MEM_ALLOC_HOST_PTR
flag. This flag allocates a buffer for the device, but ensures that it lies in memory that is also accessible by the host. The workflow for these 'zero-copy' transfers usually takes on this form:
// Allocate a device buffer using host-accessible memory
d_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, size, NULL, &err);
// Get a host-pointer for the buffer
h_buffer = clEnqueueMapBuffer(queue, d_buffer, CL_TRUE, CL_MAP_WRITE,
0, size, 0, NULL, &err);
// Write data into h_buffer from the host
...
// Unmap the memory buffer
clEnqueueUnmapMemObject(queue, d_buffer, h_buffer, 0, NULL, NULL);
// Do stuff with the buffer on the device
clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_buffer);
clEnqueueNDRangeKernel(queue, kernel, ...);
This will create a device buffer, write some data into it from the host, and then run a kernel using this buffer on the device. Because of the way that the buffer was allocated, this should not result in a memory transfer if the device and host have a unified memory system.
The above approach is limited to simple, flat data structures (1D arrays). If you are interested in working with something a little more complex such as linked-lists, trees or any other pointer-based data structures, you'll need to take advantage of the Shared Virtual Memory (SVM) feature in OpenCL 2.0. At the time of writing, AMD and Intel have both released some preview support for OpenCL 2.0 functionality, but I cannot vouch for their implementations of SVM.
The workflow for the SVM approach will be somewhat similar to the code listed above. In short, you will allocate a buffer using clSVMAlloc
, which will return a pointer that is valid on both the host and device. You will use clEnqueueSVMMap
and clEnqueueSVMUnmap
to synchronise the data when you wish to access the buffer from the host, and clSetKernelArgSVMPointer
to pass it to the device. The crucial difference between SVM and CL_MEM_ALLOC_HOST_PTR
is that an SVM pointer can also be included inside another buffer passed to the device (e.g. inside a struct or pointed to by another pointer). This is what allows you to build complex pointer-based data structures that can be shared between the host and device.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With