In OpenCL
, are there any performance benefits to flagging buffers as READ_ONLY
or WRITE_ONLY
?
This kernel
is what I often see (a is READ_ONLY
and b is WRITE_ONLY
):
__kernel void two_buffer_double(__global float* a, __global float* b)
{
int i = get_global_id(0);
b[i] = a[i] * 2;
}
This kernel
seems better, because it uses less global memory (a is READ_WRITE
):
__kernel void one_buffer_double(__global float* a)
{
int i = get_global_id(0);
a[i] = a[i] * 2;
}
Do READ_ONLY
and WRITE_ONLY
flags just exist to help with debugging and catching errors?
Note that there are actually two kinds of those. You have CL_MEM_READ_ONLY
, CL_MEM_WRITE_ONLY
and CL_MEM_READ_WRITE
when allocating your buffers but then you also have __read_only
, __write_only
and __read_write
to decorate your pointers in the kernel code with.
These could be used for both optimization and error checking. Lets look at the performance first. If a write-only buffer is encountered, the writes need not be cached (as in write through cache), saving more cache for the reads. This depends on the GPU hardware a lot and at least NVIDIA hardware does have the instructions needed to actually implement this (the .cs
and .lu
modifiers). You can refer to their PTX ISA. I haven't seen any evidence of the compiler actually performing this optimization, e.g.:
__kernel void Memset4(__global __write_only unsigned int *p_dest,
const unsigned int n_dword_num)
{
unsigned int i = get_global_id(0);
if(i < n_dword_num)
p_dest[i] = 0; // this
}
gets compiled as:
st.global.u32 [%r10], %r11; // no cache operation specified
This makes sense as CUDA does not have equivalents for those qualifiers so the compiler most likely silently ignores those. But it does not hurt to put them there, we might get luckier in the future. In CUDA, some of this functionality is exposed using the __ldg
function and by using compiler flags to opt in/out of caching the global memory transfers in L1 (-Xptxas -dlcm=cg
). You can also always use asm
if you find that bypassing cache yields a major advantage.
As for error checking, writing to a read-only buffer is readily avoided using the const
specifier in the kernel declaration. Disallowing reading from a write-only buffer is not possible in pure "C".
Another possible optimization happens when mapping those buffers to host memory. When mapping a CL_MEM_READ_ONLY
buffer, the mapped region could be left uninitialized as the host will only write to that memory, for the device to only read it. Similarly, when un-mapping a CL_MEM_WRITE_ONLY
buffer, the driver does not need to copy the (potentially modified by the host) contents from the host memory to the device memory. I did not measure this.
As a side note, I have tried using:
inline unsigned int n_StreamingLoad(__global __read_only const unsigned int *p_src)
{
#ifdef NVIDIA
unsigned int n_result;
asm("ld.global.cs.u32 %r0, [%r1];" : "=r" (n_result) : "r" (p_src));
return n_result;
#else // NVIDIA
return *p_src; // generic
#endif // NVIDIA
}
inline void StreamingWrite(__global __write_only unsigned int *p_dest, const unsigned int n_value)
{
#ifdef NVIDIA
asm("st.global.cs.u32 [%r0], %r1;" : : "r" (p_dest), "r" (n_value) : "memory");
#else // NVIDIA
*p_dest = n_value; // generic
#endif // NVIDIA
}
which gives you about 15 extra GB/sec even on a simple memcpy kernel with sm_35
devices (tested on GTX 780 and K40). Haven't seen noticeable speedup on sm_30
(not sure if its even meant to be supported there - although the instructions are not being stripped from ptx). Note that you need to define NVIDIA
yourself (or see Detect OpenCL device vendor in kernel code).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With