Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is private memory slower than local memory?

Tags:

opencl

I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.

I wanted still more speed up so copied from local to private which degraded the performance

So is it correct that I think we must not use to much private memory which may degrade the performance?

like image 339
Megharaj Avatar asked Mar 27 '12 08:03

Megharaj


People also ask

Is shared memory slower than global?

Size and BandwidthPer-block shared memory is faster than global memory and constant memory, but is slower than the per-thread registers. Each block has a maximum of 48k of shared memory for K20. Per-thread registers can only hold a small amount of data, but are the fastest.

How much local memory is slower than register and shared memory?

Local memory: Resides in global memory and can be 150x slower than register or shared memory.

What is private data memory?

Private memory is memory allocated by VirtualAlloc and not suballocated either by the Heap Manager or the . NET run time. It cannot be shared with other processes, is charged against the system commit limit, and typically contains application data.

How do you access private memory of a process?

Process private memory can be tracked through many tools such as Task Manager, Resource Monitor, and Sysinternals Process Explorer. The performance counter for this is \Process(*)\Private Bytes and the closest WMI property is Win32_Process. PageFileUsage.


1 Answers

Ashwin's answer is in the right direction but a little misleading.

OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.

Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.

The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.

like image 147
James Beilby Avatar answered Oct 12 '22 23:10

James Beilby