How do I use local memory in OpenCL?

Tags:

opencl

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.

For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:

__kernel square(
    __global float *input,
    __global float *output,
    const unsigned int count)
{
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];
}

If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.

801

asked Mar 29 '10 23:03

splicer

3 Answers

Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.

Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.

__kernel square(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        // if the threads were reading data from other threads, then we would
        // want a barrier here to ensure the write completes before the read
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}

107

answered Oct 03 '22 00:10

Tom

There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:

__local float localBuffer[1024];

This removes code due to less clSetKernelArg calls.

answered Oct 03 '22 00:10

Rick-Rainer Ludwig

In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.

If you are not careful with local memory, you may end up with worse performance some time than using global memory.

answered Oct 02 '22 23:10

Hunter Wang

Related questions
                            
                                List of OpenCL compliant CPU/GPU
                            
                                Convenient way to show OpenCL error codes?
                            
                                CUDA vs OpenCL performance comparison
                            
                                What is the difference between creating a buffer object with clCreateBuffer + CL_MEM_COPY_HOST_PTR vs. clCreateBuffer + clEnqueueWriteBuffer?
                            
                                Opencl function found deprecated by Visual Studio
                            
                                Compiling an OpenCL program using a CL/cl.h file
                            
                                Xcode refuses to build one of my OpenCL projects but builds another one successfully
                            
                                CUDA / OpenCL within a Virtual Machine / Hypervisor [closed]
                            
                                openacc vs openmp & mpi differences ?
                            
                                OpenCL: work group concept
                            
                                Are either the IPad or IPhone capable of OpenCL?
                            
                                How to obtain OpenCL SDK?
                            
                                How to use OpenCL on Android?
                            
                                OpenCL vs OpenMP performance [closed]
                            
                                OpenCL, Vulkan, Sycl
                            
                                Can I program Nvidia's CUDA using only Python or do I have to learn C?
                            
                                library is linked but reference is undefined
                            
                                Causes for CL_INVALID_WORK_GROUP_SIZE
                            
                                How to get a "random" number in OpenCL
                            
                                Debugger for OpenCL [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With