CUDA provides built-in vector data types like uint2
, uint4
and so on. Are there any advantages to using these data types?
Let's assume that I have a tuple which consists of two values, A and B. One way to store them in memory is to allocate two arrays. The first array stores all the A values and the second array stores all the B values at indexes that correspond to the A values. Another way is to allocate one array of type uint2
. Which one should I use? Which way is recommended? Does members of uint3
i.e x
, y
, z
reside side by side in memory?
Elementwise squaring of vectors in CUDA 4. Further Reading To understand vector operation on the GPU, we will start by writing a vector addition program on the CPU and then modify it to utilize the parallel structure of GPU. We will take two arrays of some numbers and store the answer of element-wise addition in the third array.
Advantages of CUDA: Huge increase in processing power over conventional CPU processing. Early reports suggest speed increases of 10x to 200x over CPU processing speed. Researchers can use several GPU's to preform the same amount of operations as many servers in less time, thus saving money, time, and space.
CUDA is NVIDIA's latest attempt at harnessing the full potential of their GPU's by allowing programs to be offloaded from the CPU to the GPU, leading to extreme increases in processing power. CUDA uses a form of the C language to preform its processes.
Unified memory (in CUDA 6.0 or later) and unified virtual memory (in CUDA 4.0 or later) Shared memory—provides a faster area of shared memory for CUDA threads. It can be used as a caching mechanism, and provides more bandwidth than texture lookups. Scattered reads: code can be read from any address in memory.
This is going to be a bit speculative but may add to @ArchaeaSoftware's answer.
I'm mainly familiar with Compute Capability 2.0 (Fermi). For this architecture, I don't think that there is any performance advantage to using the vectorized types, except maybe for 8- and 16-bit types.
Looking at the declaration for char4:
struct __device_builtin__ __align__(4) char4
{
signed char x, y, z, w;
};
The type is aligned to 4 bytes. I don't know what __device_builtin__
does. Maybe it triggers some magic in the compiler...
Things look a bit strange for the declarations of float1
, float2
, float3
and float4
:
struct __device_builtin__ float1
{
float x;
};
__cuda_builtin_vector_align8(float2, float x; float y;);
struct __device_builtin__ float3
{
float x, y, z;
};
struct __device_builtin__ __builtin_align__(16) float4
{
float x, y, z, w;
};
float2
gets some form of special treatment. float3
is a struct without any alignment and float4
gets aligned to 16 bytes. I'm not sure what to make of that.
Global memory transactions are 128 bytes, aligned to 128 bytes. Transactions are always performed for a full warp at a time. When a warp reaches a function that performs a memory transaction, say a 32-bit load from global memory, the chip will at that time perform as many transactions as are necessary for servicing all the 32 threads in the warp. So, if all the accessed 32-bit values are within a single 128-byte line, only one transaction is necessary. If the values come from different 128-byte lines, multiple 128-byte transactions are performed. For each transaction, the warp is put on hold for around 600 cycles while the data is fetched from memory (unless it's in the L1 or L2 caches).
So, I think the key to finding out what type of approach gives the best performance, is to consider which approach causes the fewest 128-byte memory transactions.
Assuming that the built in vector types are just structs, some of which have special alignment, using the vector types causes the values to be stored in an interleaved way in memory (array of structs). So, if the warp is loading all the x
values at that point, the other values (y
, z
, w
) will be pulled in to L1 because of the 128-byte transactions. When the warp later tries to access those, it's possible that they are no longer in L1, and so, new global memory transactions must be issued. Also, if the compiler is able to issue wider instructions to read more values in at the same time, for future use, it will be using registers for storing those between the point of the load and the point of use, perhaps increasing the register usage of the kernel.
On the other hand, if the values are packed into a struct of arrays, the load can be serviced with as few transactions as possible. So, when reading from the x
array, only x
values are loaded in the 128-byte transactions. This could cause fewer transactions, less reliance on the caches and a more even distribution between compute and memory operations.
I don't believe the built-in tuples in CUDA ([u]int[2|4], float[2|4], double[2]) have any intrinsic advantages; they exist mostly for convenience. You could define your own C++ classes with the same layout and the compiler would operate on them efficiently. The hardware does have native 64-bit and 128-bit loads, so you'd want to check the generated microcode to know for sure.
As for whether you should use an array of uint2 (array of structures or AoS) or two arrays of uint (structure of arrays or SoA), there are no easy answers - it depends on the application. For built-in types of convenient size (2x32-bit or 4x32-bit), AoS has the advantage that you only need one pointer to load/store each data element. SoA requires multiple base pointers, or at least multiple offsets and separate load/sore operations per element; but it may be faster for workloads that sometimes only operate on a subset of the elements.
As an example of a workload that uses AoS to good effect, look at the nbody sample (which uses float4 to hold XYZ+mass of each particle). The Black-Scholes sample uses SoA, presumably because float3 is an inconvenient element size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With