For a perfectly coalesced accesses to an array of 4096 doubles, each 8 bytes, nvprof reports the following metrics on a Nvidia Tesla V100:
global_load_requests: 128
gld_transactions: 1024
gld_transactions_per_request: 8.000000
I cannot find a specific definition of what a transaction and a request to global memory are exactly, so I am having trouble understanding these metrics. Therefore my questions:
gld_transactions_per_request = 8.00000 indicate perfectly coalesced accesses to doubles?In an attempt to explain it to myself, this what I have come up with:
32 threads * 8 bytes = 256 byte load. -- Is this correct?32 byte load instruction. In this scenario one transaction defined in this way is able to load 32 bytes / 8 bytes = 4 doubles. -- Is this correct? If so, is this the largest load instruction Cuda implements?Using these definitions, I arrive at the same values as nvprof: Accessing 4096 array items requires 128 warp-level instructions (=requests) with 32 threads each. Using 32 byte loads (=transactions) results in the 1024 transactions.
A memory "request" is an instruction which accesses memory, and a "transaction" is the movement of a unit of data between two regions of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With