I've read that CUDA can read from global memory 128 bytes at at time, so it makes sense that each thread in a warp can read/write 4 bytes in a coalesced pattern for a total of 128 bytes.
Reading/writing with the vector types like int4 and float4 is faster.
But what I don't understand why this is. If each thread in the warp is requesting 16 bytes, and only 128 bytes can move across the bus at a time, where does the performance gain come from?
Is it because there are fewer memory requests happening i.e. it is saying "grab 16 bytes for each thread in this warp" once, opposed to "grab 4 bytes for each thread in this warp" 4 times? I can't find anything in the literature that says the exact reason why the vector types are faster.
I've read that CUDA can read from global memory 128 bytes at at time, so it makes sense that each thread in a warp can read/write 4 bytes in a coalesced pattern for a total of 128 bytes. Reading/writing with the vector types like int4 and float4 is faster.
Vectorized loads are a fundamental CUDA optimization that you should use when possible, because they increase bandwidth, reduce instruction count, and reduce latency. In this post, I’ve shown how you can easily incorporate vectorized loads into existing kernels with relatively few changes.
The easiest way to use vectorized loads is to use the vector data types defined in the CUDA C/C++ standard headers, such as int2, int4, or float2. You can easily use these types via type casting in C/C++. For example in C++ you can recast the int pointer d_in to an int2 pointer using reinterpret_cast<int2*> (d_in).
For example in C++ you can recast the int pointer d_in to an int2 pointer using reinterpret_cast<int2*> (d_in). In C99 you can do the same thing using the casting operator: (int2* (d_in)). Dereferencing those pointers will cause the compiler to generate the vectorized instructions.
Your last paragraph is basically the answer to your question. The performance improvement comes from efficiency gains, in two ways
So you get efficiency gains both at the multiprocessor and memory controller by using vector memory instructions, as compared to issuing individual instructions which produce individual memory transactions to get the same amount of bytes from global memory
You have thorough answer for the question in Parallel4All blog: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
The main reason is less index arithmetics per byte loaded in case vector loads are used.
There is another one - more loads in flight, which helps saturate memory bandwidth in cases of low occupancy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With