I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency. I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure. Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold? What about texture cache?

For (Kepler) Tesla K20 the latencies are as follows: <blockquote> Global memory: 440 clocks Constant memory L1: 48 clocks L2: 120 clocks Shared memory: 48 clocks Texture memory L1: 108 clocks L2: 240 clocks </blockquote> How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280. This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the <code>sm_20</code> to <code>sm_35</code> for compiling. There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.

How many memory latency cycles per memory access type in OpenCL/CUDA?

Tags:

memory

cuda

nvidia

latency

opencl

I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.

I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.

Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?

What about texture cache?

638

asked Nov 04 '10 14:11

smuggledPancakes

2 Answers

For (Kepler) Tesla K20 the latencies are as follows:

Global memory: 440 clocks
Constant memory
    L1: 48 clocks
    L2: 120 clocks
Shared memory: 48 clocks
Texture memory
    L1: 108 clocks
    L2: 240 clocks

How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.

This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.

There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.

167

answered Sep 19 '22 01:09

the swine

The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.

The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.

You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.

Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.

answered Sep 21 '22 01:09

Tom

Related questions
                            
                                Memory Leak in Rails App... string nightmare
                            
                                Store and retrieve large sparse matrix [closed]
                            
                                MemoryError while loading huge initial data
                            
                                Memory Hacking/Modifying in C++ [closed]
                            
                                Lua optimize memory
                            
                                Need to free QList contents?
                            
                                How Much Memory Can I Allocate?
                            
                                Is an array of vectors entirely contiguous memory?
                            
                                DMA and I/O memory region under Linux
                            
                                Cuda: pinned memory zero copy problems
                            
                                C++ Read Memory Address / Pointer & Offset
                            
                                How can get memory and CPU usage of hadoop yarn application?
                            
                                What is the relation between address lines and memory?
                            
                                Juicy Pixels complains about not having enough memory
                            
                                Malloc is using 10x the amount of memory necessary
                            
                                Why position of `[0]byte` in the struct matters?
                            
                                Change random memory location on purpose between two executions in c
                            
                                Finding and using memory offsets in an existing program?
                            
                                C#: Memory usage of an object
                            
                                Does declaring a variable as "private" in C# protect the memory in windows from being accessed by a memory scanner?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With