The cuda profiler manual states that due to the more relaxed coalescing policy, the number of uncoalesced memory transactions will always be zero. But I'm sure that there are still uncoalescing. How to calculate it? Is there any tools or simulator around that can help? Among them, which one seems to be the most accurate? Thanks
In devices 1.0, you had only two options:
In devices 1.2 and 1.3 however this is done differently. Imagine your device memory divided into chunks of 128 bytes each. You need as many memory transactions as the number of chunks you hit. So:
There are so many cases, so putting it into just 2 categories: coalesced/uncoalesced does not make any sense anymore. That is why, the Cuda Profiler went a different way. They simply count the number of memory transactions. The more random your access pattern is, the higher memory transaction count, even if you have the same count of memory access instructions.
The above is slightly simplified model. In reality, memory transaction can access 128-byte, 64-byte or 32-byte wide chunk - to save up bandwidth. Look for columns load 128b, load 64b, load 32b, and store 128b, store 64b, store 32b in your profiler.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With