I've a vector on the host and I want to halve it and send to the device. Doing a benchmark shows that CL_MEM_ALLOC_HOST_PTR is faster than CL_MEM_USE_HOST_PTR and much faster than CL_MEM_COPY_HOST_PTR. Also memory analysis on device doesn't show any difference in the buffer size created on device. This differs from the documentation of the mentioned flag on Khronos- clCreateBuffer. Does anyone know what's going on?
The answer by Pompei 2 is incorrect. The specification makes no guarantee as to where the memory is allocated but only how it is allocated. CL_MEM_ALLOC_HOST_PTR makes the clCreateBuffer allocate the host side memory for you. You can then map this into a host pointer using clEnqueueMapBuffer. CL_MEM_USE_HOST_PTR will cause the runtime to scoop up the data you give it into a OpenCL buffer.
Pinned memory is achieved through the use of CL_MEM_ALLOC_HOST_PTR: the runtime is able to allocate the memory as it can.
All this performance is implementation dependant. Reading section 3.1.1 more carefully will show that in one of the calls (with no CL_MEM flag) NVIDIA is able to preallocate a device side buffer whilst the other calls merely get the pinned data mapped into a host pointer ready for writing to the device.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With