In CUDA profiler, there are two metrics called dram_read_transactions and gld_transactions. The cuda profiler user guide says "gld_transactions" means the number of global memory load transactions, while "dram_read_transactions" means device memory read transactions. I cannot tell the difference between these descriptions because reading data means loading data and global memory is dram. But the profiling results of these two metrics are different. I tested with one kernel. For the same kernel with different threads settings, the gld_transactions is always the same value 33554432. And this value is stable. But for dram_read_transactions, two different threads settings lead to different values, they are roughly 4486636 and 4197096. For the word "roughly" I mean these values are not stable because they slightly change from one execution to another. We can also see the dram_transactions is much less than gld_transactions. So my questions can be summarized here:
I think once we know the answer for question (1), then questions (2) and (3) can be easily explained. So can anyone explain this? Thanks in advance.
A global load refers to a logical memory space. A dram read refers to a transaction on a physical resource. This statement of yours:
reading data means loading data and global memory is dram.
is either incorrect or glossing over important details.
Fundamentally, global loads are issued by instructions executed by a warp. The initial target of these loads will be L1 or L2 cache (usually). A global load, if satisfied by cache contents, will never become a dram read transaction. On the other hand, if the target of the global load is not in a cache, then it will become a dram read transaction (typically/usually).
Furthermore, the global memory space is not the only memory space. There are other memory spaces, such as local. Transactions to "local" memory can also ultimately be serviced in a variety of ways, one of which would be actually triggering a dram read. Such a transaction would not show up in any "global" metric but would show up in the dram read transaction metric.
I find this diagram/chart in the nsight VSE documentation (and tool help), of the logical and physical arrangement of memory on a GPU to be helpful in inderstanding this. I have excerpted the chart here, and highlighted in red the "links" that correspond to the metrics you identified:

This answer gives a more detailed decoding of the above diagram, for relevant metrics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With