Are there advantages to using the CUDA vector types?

Tags:

abstract-data-type

CUDA provides built-in vector data types like uint2, uint4 and so on. Are there any advantages to using these data types?

Let's assume that I have a tuple which consists of two values, A and B. One way to store them in memory is to allocate two arrays. The first array stores all the A values and the second array stores all the B values at indexes that correspond to the A values. Another way is to allocate one array of type uint2. Which one should I use? Which way is recommended? Does members of uint3 i.e x, y, z reside side by side in memory?

495

asked Sep 09 '12 15:09

username_4567

2 Answers

This is going to be a bit speculative but may add to @ArchaeaSoftware's answer.

I'm mainly familiar with Compute Capability 2.0 (Fermi). For this architecture, I don't think that there is any performance advantage to using the vectorized types, except maybe for 8- and 16-bit types.

Looking at the declaration for char4:

struct __device_builtin__ __align__(4) char4
{
    signed char x, y, z, w;
};

The type is aligned to 4 bytes. I don't know what __device_builtin__ does. Maybe it triggers some magic in the compiler...

Things look a bit strange for the declarations of float1, float2, float3 and float4:

struct __device_builtin__ float1
{
    float x;
};

__cuda_builtin_vector_align8(float2, float x; float y;);

struct __device_builtin__ float3
{
    float x, y, z;
};

struct __device_builtin__ __builtin_align__(16) float4
{
    float x, y, z, w;
};

float2 gets some form of special treatment. float3 is a struct without any alignment and float4 gets aligned to 16 bytes. I'm not sure what to make of that.

Global memory transactions are 128 bytes, aligned to 128 bytes. Transactions are always performed for a full warp at a time. When a warp reaches a function that performs a memory transaction, say a 32-bit load from global memory, the chip will at that time perform as many transactions as are necessary for servicing all the 32 threads in the warp. So, if all the accessed 32-bit values are within a single 128-byte line, only one transaction is necessary. If the values come from different 128-byte lines, multiple 128-byte transactions are performed. For each transaction, the warp is put on hold for around 600 cycles while the data is fetched from memory (unless it's in the L1 or L2 caches).

So, I think the key to finding out what type of approach gives the best performance, is to consider which approach causes the fewest 128-byte memory transactions.

Assuming that the built in vector types are just structs, some of which have special alignment, using the vector types causes the values to be stored in an interleaved way in memory (array of structs). So, if the warp is loading all the x values at that point, the other values (y, z, w) will be pulled in to L1 because of the 128-byte transactions. When the warp later tries to access those, it's possible that they are no longer in L1, and so, new global memory transactions must be issued. Also, if the compiler is able to issue wider instructions to read more values in at the same time, for future use, it will be using registers for storing those between the point of the load and the point of use, perhaps increasing the register usage of the kernel.

On the other hand, if the values are packed into a struct of arrays, the load can be serviced with as few transactions as possible. So, when reading from the x array, only x values are loaded in the 128-byte transactions. This could cause fewer transactions, less reliance on the caches and a more even distribution between compute and memory operations.

149

answered Sep 20 '22 15:09

Roger Dahl

I don't believe the built-in tuples in CUDA ([u]int[2|4], float[2|4], double[2]) have any intrinsic advantages; they exist mostly for convenience. You could define your own C++ classes with the same layout and the compiler would operate on them efficiently. The hardware does have native 64-bit and 128-bit loads, so you'd want to check the generated microcode to know for sure.

As for whether you should use an array of uint2 (array of structures or AoS) or two arrays of uint (structure of arrays or SoA), there are no easy answers - it depends on the application. For built-in types of convenient size (2x32-bit or 4x32-bit), AoS has the advantage that you only need one pointer to load/store each data element. SoA requires multiple base pointers, or at least multiple offsets and separate load/sore operations per element; but it may be faster for workloads that sometimes only operate on a subset of the elements.

As an example of a workload that uses AoS to good effect, look at the nbody sample (which uses float4 to hold XYZ+mass of each particle). The Black-Scholes sample uses SoA, presumably because float3 is an inconvenient element size.

answered Sep 21 '22 15:09

ArchaeaSoftware

Related questions
                            
                                Techniques to Reduce CPU to GPU Data Transfer Latency
                            
                                Difference between kernels construct and parallel construct
                            
                                Reduce matrix rows with CUDA
                            
                                Finding the maximum element value AND its position using CUDA Thrust
                            
                                How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?
                            
                                Is local memory slower than shared memory in CUDA?
                            
                                Confusion on CUDA/openCL and C++ AMP
                            
                                Ubuntu 14.04 how to install cuda 6.5 without installing nvidia driver
                            
                                Best practice for upgrading CUDA and cuDNN for tensorflow
                            
                                Forcing CUDA to use register for a variable
                            
                                What is the maximum block count possible in CUDA?
                            
                                Which CUDA Toolkit version for older NVIDIA Driver
                            
                                Easiest way to test for existence of cuda-capable GPU from cmake?
                            
                                Installing theano on Windows 8 with GPU enabled
                            
                                Timing CUDA operations
                            
                                Funnel shift - what is it?
                            
                                Financial applications on GPGPU
                            
                                How to calculate the speedup of a GPU program?
                            
                                Can I use C++11 in the .cu-files (CUDA5.5) in Windows7x64 (MSVC) and Linux64 (GCC4.8.2)?
                            
                                Could not insert 'nvidia_352': No such device

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With