Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are CUDA vector types (int4, float4) faster?

Tags:

cuda

I've read that CUDA can read from global memory 128 bytes at at time, so it makes sense that each thread in a warp can read/write 4 bytes in a coalesced pattern for a total of 128 bytes.

Reading/writing with the vector types like int4 and float4 is faster.

But what I don't understand why this is. If each thread in the warp is requesting 16 bytes, and only 128 bytes can move across the bus at a time, where does the performance gain come from?

Is it because there are fewer memory requests happening i.e. it is saying "grab 16 bytes for each thread in this warp" once, opposed to "grab 4 bytes for each thread in this warp" 4 times? I can't find anything in the literature that says the exact reason why the vector types are faster.

like image 237
user13741 Avatar asked Jul 16 '15 07:07

user13741


People also ask

How many bytes can CUDA read from global memory at a time?

I've read that CUDA can read from global memory 128 bytes at at time, so it makes sense that each thread in a warp can read/write 4 bytes in a coalesced pattern for a total of 128 bytes. Reading/writing with the vector types like int4 and float4 is faster.

What are vectorized loads in CUDA?

Vectorized loads are a fundamental CUDA optimization that you should use when possible, because they increase bandwidth, reduce instruction count, and reduce latency. In this post, I’ve shown how you can easily incorporate vectorized loads into existing kernels with relatively few changes.

How do I use vectorized loads in C++?

The easiest way to use vectorized loads is to use the vector data types defined in the CUDA C/C++ standard headers, such as int2, int4, or float2. You can easily use these types via type casting in C/C++. For example in C++ you can recast the int pointer d_in to an int2 pointer using reinterpret_cast<int2*> (d_in).

How to recast a pointer to an Int2 in C++?

For example in C++ you can recast the int pointer d_in to an int2 pointer using reinterpret_cast<int2*> (d_in). In C99 you can do the same thing using the casting operator: (int2* (d_in)). Dereferencing those pointers will cause the compiler to generate the vectorized instructions.


2 Answers

Your last paragraph is basically the answer to your question. The performance improvement comes from efficiency gains, in two ways

  1. At the instruction level, a multi-word vector load or store only requires a single instruction to be issued, so the bytes per instruction ratio is higher and total instruction latency for a particular memory transaction is lower.
  2. At the memory controller level, a vector sized transaction request from a warp results in a larger net memory throughput per transaction, so the bytes per transaction ratio is higher. Fewer transaction requests reduces memory controller contention and can produce higher overall memory bandwidth utilisation.

So you get efficiency gains both at the multiprocessor and memory controller by using vector memory instructions, as compared to issuing individual instructions which produce individual memory transactions to get the same amount of bytes from global memory

like image 121
talonmies Avatar answered Oct 08 '22 04:10

talonmies


You have thorough answer for the question in Parallel4All blog: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

The main reason is less index arithmetics per byte loaded in case vector loads are used.

There is another one - more loads in flight, which helps saturate memory bandwidth in cases of low occupancy.

like image 32
Maxim Milakov Avatar answered Oct 08 '22 03:10

Maxim Milakov