I'm wondering the advantage of the local memory in it. Since the global memory can get the item separately and freely. Can't we just use the global memory?
For example, we have a 1000*1000 image, and we want add every pixel value 1. We can use 1000*1000's global memory right?
Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts?
I'll be so appreciate for you, if you give me a simple code of the local memory.
Cann't we just use the global memory?
Of course you can. First write an actual working code. Then optimize.
Since the global memory can get the item separately and freely
Im not sure if all architectures have broadcasting ability. But Im sure if memory is accessed randomly for all threads, it gets too slow. Ray tracing is an example. Each pixel refracts/reflected to different distances and different memory areas. This is a performance hit. If every thread was accessing to global memory in a uniform way, it would be much faster.
We can use 1000*1000's global memory right?
There is a minimum value of maximum buffer size and it can be around 128MB or 1/4 of device memory. Combined size of all buffers can vary with platforms/devices, in the range of several GBs.
Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts?
That depends on the data re-use ratio and coalescedness of access pattern. Random(non coalesced) access to local memory is much faster than random(non coalesced) access to global memory. If you use too much local memory/private file, then it can be even slower because more local memory consumption leads to less occupation and less memory latency hiding and more register spilling to global memory. Try to balance it with using private registers too. Or you can use a compression technique to fit more data into local memory.
If you re-use each data for lets say 256 times, then it will be around 10-20x faster for local memory than global memory access.
Here is a very simple 2D nbody code for force calculations:
// global memory access is only 257 times per item, 1 for private save
// 256 for global broadcast
// for global-to-local copy
// unoptimized version accesses 65537 times per item.
__kernel void nBodyF(__global float *x, __global float *y,
__global float *vx, __global float *vy,
__global float *fx, __global float *fy)
{
int N=65536; // this is total number of masses for this example
int LN=256; // this is length of each chunk in local memory,
// means 256 masses per compute unit
int i=get_global_id(0); // global thread id keys 0....65535
int L=get_local_id(0); // local thread id keys 0...255 for each group
float2 Fi=(float2)(0,0); // init
float xi=x[i]; float yi=y[i]; // re-use for 65536 times
__local xL[256]; __local yL[256]; //declare local mem array with constant length
for(int k=0;k<N/LN;k++) // number of chunks to fetch from global to local
{
barrier(CLK_LOCAL_MEM_FENCE); //synchronization
xL[L]=x[k*LN+L]; yL[L]=y[k*LN+L]; //get 256-element chunks into local mem
barrier(CLK_LOCAL_MEM_FENCE); //synchronization
for(int j=0;j<LN;j++) //start processing local/private variables
{
float2 F=(float2)(0,0); // private force vector init
float2 r1=(float2)(xi,yi); // private vector
float2 r2=(float2)(xL[j],yL[j]); // use local mem to get r2 vector
float2 dr=r1-r2; // private displacement
F=dr/(0.01f+dot(dr,dr)); // private force calc.
Fi.x-=F.x; Fi.y-=F.y; // private force add to private
}
}
fx[i]=Fi.x; fy[i]=Fi.y; //write result to global mem only once
}
The upper example is poor in terms of local memory re-use ratio. But half of the variables is in private memory and is re-used for 64k times.
Worst case scenario:
1)Big portion of items cannot fit GPU cache.
2)Only global memory accesses are done
3)Data is not re-used
4)Memory is accessed in a very non-uniform way.
This will make it very slow.
When data doesnt fit cache and not re-used, you should use __read_only for
necessary buffers(__write_only for writing).
If you make a convolution(or some anti-aliasing, or edge detection), data re-use will be 4 to 20 and local memory optimization gives 3-4x performance at least.
If your GPU has 300GB/s global memory bandwidth, then its local memory bandwidth would be around 3-4 TB/s. You can optimize for private registers too! Then it could be 15-20 TB/s. But this type has fewer usage areas.
Edit: If you are reading single bytes and if these bytes differ by only a small value(e.g. maximum 16) between them, then you can pack multiple variables into single bytes and decrypt them in local memoru. Example:
Global memory(copied to local mem):
Reference_byte Byte0 byte1 byte2 byte3
128 +3,-5 +24,+50 -25,-63 0, +2
Unpacking in local memory:
Reference_byte Byte0 byte1 byte2 byte3 Byte4 byte5 byte6 byte7
128 131 126 150 200 175 112 112 114
Computing results on the array
Reference_byte Byte0 byte1 byte2 byte3 Byte4 byte5 byte6 byte7
128 120 130 140 150 150 150 100 110
Packing results in local memory:
Reference_byte Byte0 byte1 byte2 byte3
128 -8,+10 +10,+10 0,0 -50, +10
Global memory(copied from local mem):
Reference_byte Byte0 byte1 byte2 byte3
128 -8,+10 +10,+10 0,0 -50, +10
//Maybe a coordinate compression for a voxel rendering.
Use a profiler that gives you cache line usage info.
TL;DR: Local memory is a lot faster. Use it when you need to access the data more than once, or want to share data between work items in the same work group.
Local memory is usually located inside the processor itself, and runs at or near the clock speed of the chip. When you use a cpu for opencl, you are actually referring to the cache memory when you use local data structures. Because local memory uses up many of the in-chip transistors, it is expensive to have a lot of it available. Today's cpus actually utilize up to three levels of cache memory, and can still only yield from 8 to 24MB (sometimes more in specialized processors). The same concept holds true for video cards -- local memory is limited, lives inside the gpu, and runs extremely fast.
Global memory exists as memory chips external to the cpu/gpu, and is connected to the processor via a memory controller. Modern systems have the memory controller built into the processor, but in the past they could be an additional chip on a computer's motherboard. The memory controller is actually the reason why global memory has to be clocked much slower than local (cache) memory. Every bit of data which is requested by any core (or opencl compute unit) has to be requested, queued, and transmitted to the processor. The clock speed of the memory and controller greatly affects the signal integrity.
Global memory is inexpensive when compared to local memory though. We can buy 8GB of system memory today for well under $100. Graphics cards usually don't come with less than 1GB of global memory, and 2-4GB amounts are becoming common.
Local memory is an order of magnitude faster and 2 to 3 orders-of-magnitude more expensive than global memory. (the same can be said when comparing RAM to hard disk capacity, speed, and cost)
When should you use local memory? The greatest benefits are seen when your algorithm requires repeated reads from the same memory address. Because local memory is so much faster than global memory, even a single reuse of the data can yield an improvement. The more times you re-read the memory, the less you rely on your global memory controller's performance, and your arithmetic logic units can be crunching data for more of their clock cycles (known as saturation of the ALUs).
You also need to use local memory if your algorithm needs to communicate between two work units in the same group. This is achieved in opencl with local reads/writes and barriers. This can be achieved through the use of global memory too, but at a fraction of the speed that local memory can provide. Think of it this way: The work items in a group need to share a simple data structure, maybe a float you want to accumulate a sum into. You can 1) store it in a local, work-group-shared memory location within the gpu/cpu and sync the following reads to the variable, or 2) send the 32-bit float to the global memory controller, queue it for write, write the value to global memory, request the data back to another work item through the memory controller and back to the processor.
Just a couple examples of algorithms that benefit from local work-group-shared memory are matrix-matrix multiplication, vector-matrix multiplication, n-body systems, some image filters, and sorting small sets that fit entirely in local memory.
For your specific example of breaking up 1000^2 pixels into 100^3 pixels, there would be no benefit. The values are only read once, modified, and written once. There would be a benefit to breaking it up if you were performing a filter such as a blur, and had to read the surrounding pixels to compute a given pixel's new value. You just have to be careful that the fragmented data structures can fit into the local memory for your target opencl devices when deciding how small to make those fragments.
1) Why to use Local Memory: Global Memory is large in size but slow in terms of data retrieval / data access whereas Local Memory is much smaller in size but very fast compare to the Global Memory.
In simple terms the local memory acts as cache (Practically its not the case) to the global memory if used correctly.
2) There is no restriction on the use of local memory, it is up-to the application need and programmers satisfaction on performance.
3) When to use local Memory If you look at the cache, which keeps the data in its memory which is getting highly reused to avoid the high data access cost. The similar approach can be followed for Local Memory, If the threads in work-group access same data (or data segment) again and again then such data (or data segment) can be pulled to local memory which in terms drop your data access to cost from global to local for each reuse.
you can see a performance gain with simple matrix multiplication here
Local memory has a much higher bandwidth and lower latency than global memory. Therefore, if threads within the same workgroup share data in local memory they can access it faster than if the sharing took place in global memory. The key thing to point out here is that threads in the same workgroup must have data to share with each other, otherwise local memory has no benefit. In your example, each pixel is used once only so local memory won't help there.
Two examples where local memory is useful are convolutions (e.g. Gaussian blur an image) and parallel reduction. In parallel reduction, threads generate intermediate results that need to be shared across the workgroup. Only the final result gets written to global memory by a single thread in that workgroup. With convolutions, each pixel gets re-used when calculating the blur for its neighbour pixels, hence local memory can be used to store a patch of the image for re-use by neighbouring threads. Let me know in the comments which of these you want an example for, if either are useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With