Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCL - Why Use READ_ONLY or WRITE_ONLY Buffers

In OpenCL, are there any performance benefits to flagging buffers as READ_ONLY or WRITE_ONLY?

This kernel is what I often see (a is READ_ONLY and b is WRITE_ONLY):

__kernel void two_buffer_double(__global float* a, __global float* b)
{
    int i = get_global_id(0);
    b[i] = a[i] * 2;
}

This kernel seems better, because it uses less global memory (a is READ_WRITE):

__kernel void one_buffer_double(__global float* a)
{
    int i = get_global_id(0);
    a[i] = a[i] * 2;
}

Do READ_ONLY and WRITE_ONLY flags just exist to help with debugging and catching errors?

like image 688
benshope Avatar asked Jul 27 '13 19:07

benshope


1 Answers

Note that there are actually two kinds of those. You have CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY and CL_MEM_READ_WRITE when allocating your buffers but then you also have __read_only, __write_only and __read_write to decorate your pointers in the kernel code with.

These could be used for both optimization and error checking. Lets look at the performance first. If a write-only buffer is encountered, the writes need not be cached (as in write through cache), saving more cache for the reads. This depends on the GPU hardware a lot and at least NVIDIA hardware does have the instructions needed to actually implement this (the .cs and .lu modifiers). You can refer to their PTX ISA. I haven't seen any evidence of the compiler actually performing this optimization, e.g.:

__kernel void Memset4(__global __write_only unsigned int *p_dest,
    const unsigned int n_dword_num)
{
    unsigned int i = get_global_id(0);
    if(i < n_dword_num)
        p_dest[i] = 0; // this
}

gets compiled as:

st.global.u32 [%r10], %r11; // no cache operation specified

This makes sense as CUDA does not have equivalents for those qualifiers so the compiler most likely silently ignores those. But it does not hurt to put them there, we might get luckier in the future. In CUDA, some of this functionality is exposed using the __ldg function and by using compiler flags to opt in/out of caching the global memory transfers in L1 (-Xptxas -dlcm=cg). You can also always use asm if you find that bypassing cache yields a major advantage.

As for error checking, writing to a read-only buffer is readily avoided using the const specifier in the kernel declaration. Disallowing reading from a write-only buffer is not possible in pure "C".

Another possible optimization happens when mapping those buffers to host memory. When mapping a CL_MEM_READ_ONLY buffer, the mapped region could be left uninitialized as the host will only write to that memory, for the device to only read it. Similarly, when un-mapping a CL_MEM_WRITE_ONLY buffer, the driver does not need to copy the (potentially modified by the host) contents from the host memory to the device memory. I did not measure this.

As a side note, I have tried using:

inline unsigned int n_StreamingLoad(__global __read_only const unsigned int *p_src)
{
#ifdef NVIDIA
    unsigned int n_result;
    asm("ld.global.cs.u32 %r0, [%r1];" : "=r" (n_result) : "r" (p_src));
    return n_result;
#else // NVIDIA
    return *p_src; // generic
#endif // NVIDIA
}

inline void StreamingWrite(__global __write_only unsigned int *p_dest, const unsigned int n_value)
{
#ifdef NVIDIA
    asm("st.global.cs.u32 [%r0], %r1;" : : "r" (p_dest), "r" (n_value) : "memory");
#else // NVIDIA
    *p_dest = n_value; // generic
#endif // NVIDIA
}

which gives you about 15 extra GB/sec even on a simple memcpy kernel with sm_35 devices (tested on GTX 780 and K40). Haven't seen noticeable speedup on sm_30 (not sure if its even meant to be supported there - although the instructions are not being stripped from ptx). Note that you need to define NVIDIA yourself (or see Detect OpenCL device vendor in kernel code).

like image 162
the swine Avatar answered Dec 16 '22 21:12

the swine