I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.
I wanted still more speed up so copied from local to private which degraded the performance
So is it correct that I think we must not use to much private memory which may degrade the performance?
Size and BandwidthPer-block shared memory is faster than global memory and constant memory, but is slower than the per-thread registers. Each block has a maximum of 48k of shared memory for K20. Per-thread registers can only hold a small amount of data, but are the fastest.
Local memory: Resides in global memory and can be 150x slower than register or shared memory.
Private memory is memory allocated by VirtualAlloc and not suballocated either by the Heap Manager or the . NET run time. It cannot be shared with other processes, is charged against the system commit limit, and typically contains application data.
Process private memory can be tracked through many tools such as Task Manager, Resource Monitor, and Sysinternals Process Explorer. The performance counter for this is \Process(*)\Private Bytes and the closest WMI property is Win32_Process. PageFileUsage.
Ashwin's answer is in the right direction but a little misleading.
OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.
Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.
The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With