Why are CUDA vector types (int4, float4) faster?

Tags:

cuda

I've read that CUDA can read from global memory 128 bytes at at time, so it makes sense that each thread in a warp can read/write 4 bytes in a coalesced pattern for a total of 128 bytes.

Reading/writing with the vector types like int4 and float4 is faster.

But what I don't understand why this is. If each thread in the warp is requesting 16 bytes, and only 128 bytes can move across the bus at a time, where does the performance gain come from?

Is it because there are fewer memory requests happening i.e. it is saying "grab 16 bytes for each thread in this warp" once, opposed to "grab 4 bytes for each thread in this warp" 4 times? I can't find anything in the literature that says the exact reason why the vector types are faster.

237

asked Jul 16 '15 07:07

user13741

2 Answers

Your last paragraph is basically the answer to your question. The performance improvement comes from efficiency gains, in two ways

At the instruction level, a multi-word vector load or store only requires a single instruction to be issued, so the bytes per instruction ratio is higher and total instruction latency for a particular memory transaction is lower.
At the memory controller level, a vector sized transaction request from a warp results in a larger net memory throughput per transaction, so the bytes per transaction ratio is higher. Fewer transaction requests reduces memory controller contention and can produce higher overall memory bandwidth utilisation.

So you get efficiency gains both at the multiprocessor and memory controller by using vector memory instructions, as compared to issuing individual instructions which produce individual memory transactions to get the same amount of bytes from global memory

121

answered Oct 08 '22 04:10

talonmies

You have thorough answer for the question in Parallel4All blog: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

The main reason is less index arithmetics per byte loaded in case vector loads are used.

There is another one - more loads in flight, which helps saturate memory bandwidth in cases of low occupancy.

answered Oct 08 '22 03:10

Maxim Milakov

Related questions
                            
                                Trying to get CUDA working, sample can't find helper_cuda.h
                            
                                CUDA linking error - Visual Express 2008 - nvcc fatal due to (null) configuration file
                            
                                Are GPU/CUDA cores SIMD ones?
                            
                                How to use 2D Arrays in CUDA?
                            
                                Can CUDA use SIMD extensions?
                            
                                Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName()
                            
                                error: function "atomicAdd(double *, double)" has already been defined
                            
                                raytracing with CUDA
                            
                                Error:identifer "blockIdx" is undefined
                            
                                Nsight Eclipse Cuda + opencv
                            
                                Template __host__ __device__ calling host defined functions
                            
                                Visual Studio 2017 not detecting change in .cu (CUDA) files
                            
                                Parallel GPU computing using OpenCV
                            
                                efficient way of cuda file organization: .cpp .h .cu .cuh .curnel files
                            
                                Add nvidia runtime to docker runtimes
                            
                                How CUDA constant memory allocation works?
                            
                                Use dynamic shared memory allocation for two different vectors
                            
                                What is the role of cudaDeviceReset() in Cuda
                            
                                Cuda with Boost
                            
                                Several threads writing the same value in the same global memory location

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With