Is private memory slower than local memory?

1 Answers

Ashwin's answer is in the right direction but a little misleading.

OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.

Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.

The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.

147

answered Oct 12 '22 23:10

James Beilby

Related questions
                            
                                OpenCL and GPU programming Roadmap
                            
                                Measuring execution time of OpenCL kernels
                            
                                Is there a limit to OpenCL local memory?
                            
                                How to determine max size of images I can safely pass to/from OpenCL kernel?
                            
                                How to setup OpenCL on AMD videocard with opensource driver?
                            
                                Are there any good 3rd party libraries build on top of openCL yet?
                            
                                What is the algorithm to determine optimal work group size and number of workgroup
                            
                                Aligning GPU memory accesses of an image convolution (OpenCL/CUDA) kernel
                            
                                Does AMD's OpenCL offer something similar to CUDA's GPUDirect?
                            
                                Disassemble an OpenCL kernel?
                            
                                Branch predication on GPU
                            
                                Getting starting with Parallel programming [closed]
                            
                                How to pass and access C++ vectors to OpenCL kernel?
                            
                                Benchmarks comparing Intel Xeon Phi and Nvidia Tesla K20
                            
                                Can I use Julia to program my GPU & CPU?
                            
                                How to represent scientific notation in C
                            
                                OpenCL CPU Device vs GPU Device
                            
                                Using R's GPU packages on Amazon
                            
                                static openCL class not properly released in python module using boost.python
                            
                                Does Global Work Size Need to be Multiple of Work Group Size in OpenCL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is private memory slower than local memory?

Tags:

opencl

Megharaj

People also ask

1 Answers

James Beilby

Recent Activity

Donate For Us