Structure of Arrays vs Array of Structures

Tags:

From some comments that I have read in here, for some reason it is preferable to have Structure of Arrays (SoA) over Array of Structures (AoS) for parallel implementations like CUDA? If that is true, can anyone explain why? Thanks in advance!

754

asked Jul 29 '13 12:07

BugShotGG

2 Answers

Choice of AoS versus SoA for optimum performance usually depends on access pattern. This is not just limited to CUDA however - similar considerations apply for any architecture where performance can be significantly affected by memory access pattern, e.g. where you have caches or where performance is better with contiguous memory access (e.g. coalesced memory accesses in CUDA).

E.g. for RGB pixels versus separate RGB planes:

struct {     uint8_t r, g, b; } AoS[N];  struct {     uint8_t r[N];     uint8_t g[N];     uint8_t b[N]; } SoA;

If you are going to be accessing the R/G/B components of each pixel concurrently then AoS usually makes sense, since the successive reads of R, G, B components will be contiguous and usually contained within the same cache line. For CUDA this also means memory read/write coalescing.

However if you are going to process color planes separately then SoA might be preferred, e.g. if you want to scale all R values by some scale factor, then SoA means that all R components will be contiguous.

One further consideration is padding/alignment. For the RGB example above each element in an AoS layout is aligned to a multiple of 3 bytes, which may not be convenient for CUDA, SIMD, et al - in some cases perhaps even requiring padding within the struct to make alignment more convenient (e.g. add a dummy uint8_t element to ensure 4 byte alignment). In the SoA case however the planes are byte aligned which can be more convenient for certain algorithms/architectures.

For most image processing type applications the AoS scenario is much more common, but for other applications, or for specific image processing tasks this may not always be the case. When there is no obvious choice I would recommend AoS as the default choice.

See also this answer for more general discussion of AoS v SoA.

173

answered Oct 07 '22 18:10

Paul R

I just want to provide a simple example showing how a Struct of Arrays (SoA) performs better than an Array of Structs (AoS).

In the example, I'm considering three different versions of the same code:

SoA (v1)
Straight arrays (v2)
AoS (v3)

In particular, version 2 considers the use of straight arrays. The timings of versions 2 and 3 are the same for this example and result to be better than version 1. I suspect that, in general, straight arrays could be preferable, although at the expense of readability, since, for example, loading from uniform cache could be enabled through const __restrict__ for this case.

#include "cuda_runtime.h" #include "device_launch_parameters.h"  #include <stdio.h>  #include <thrust\device_vector.h>  #include "Utilities.cuh" #include "TimingGPU.cuh"  #define BLOCKSIZE   1024  /******************************************/ /* CELL STRUCT LEADING TO ARRAY OF STRUCT */ /******************************************/ struct cellAoS {      unsigned int    x1;     unsigned int    x2;     unsigned int    code;     bool            done;  };  /*******************************************/ /* CELL STRUCT LEADING TO STRUCT OF ARRAYS */ /*******************************************/ struct cellSoA {      unsigned int    *x1;     unsigned int    *x2;     unsigned int    *code;     bool            *done;  };   /*******************************************/ /* KERNEL MANIPULATING THE ARRAY OF STRUCT */ /*******************************************/ __global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {      const int tid = threadIdx.x + blockIdx.x * blockDim.x;      if (tid < N) {         cellAoS tempCell = d_cells[tid];          tempCell.x1 = tempCell.x1 + 10;         tempCell.x2 = tempCell.x2 + 10;          d_cells[tid] = tempCell;     }  }  /******************************/ /* KERNEL MANIPULATING ARRAYS */ /******************************/ __global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {      const int tid = threadIdx.x + blockIdx.x * blockDim.x;      if (tid < N) {          d_x1[tid] = d_x1[tid] + 10;         d_x2[tid] = d_x2[tid] + 10;      }  }  /********************************************/ /* KERNEL MANIPULATING THE STRUCT OF ARRAYS */ /********************************************/ __global__ void AoSvsSoA_v3(cellSoA cell, const int N) {      const int tid = threadIdx.x + blockIdx.x * blockDim.x;      if (tid < N) {          cell.x1[tid] = cell.x1[tid] + 10;         cell.x2[tid] = cell.x2[tid] + 10;      }  }  /********/ /* MAIN */ /********/ int main() {      const int N = 2048 * 2048 * 4;      TimingGPU timerGPU;      thrust::host_vector<cellAoS>    h_cells(N);     thrust::device_vector<cellAoS>  d_cells(N);      thrust::host_vector<unsigned int>   h_x1(N);     thrust::host_vector<unsigned int>   h_x2(N);      thrust::device_vector<unsigned int> d_x1(N);     thrust::device_vector<unsigned int> d_x2(N);      for (int k = 0; k < N; k++) {          h_cells[k].x1 = k + 1;         h_cells[k].x2 = k + 2;         h_cells[k].code = k + 3;         h_cells[k].done = true;          h_x1[k] = k + 1;         h_x2[k] = k + 2;      }      d_cells = h_cells;      d_x1 = h_x1;     d_x2 = h_x2;      cellSoA cell;     cell.x1 = thrust::raw_pointer_cast(d_x1.data());     cell.x2 = thrust::raw_pointer_cast(d_x2.data());     cell.code = NULL;     cell.done = NULL;      timerGPU.StartCounter();     AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);     gpuErrchk(cudaPeekAtLastError());     gpuErrchk(cudaDeviceSynchronize());     printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());      //timerGPU.StartCounter();     //AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);     //gpuErrchk(cudaPeekAtLastError());     //gpuErrchk(cudaDeviceSynchronize());     //printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());      timerGPU.StartCounter();     AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);     gpuErrchk(cudaPeekAtLastError());     gpuErrchk(cudaDeviceSynchronize());     printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());      h_cells = d_cells;      h_x1 = d_x1;     h_x2 = d_x2;      // --- Check results     for (int k = 0; k < N; k++) {         if (h_x1[k] != k + 11) {             printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);             break;         }         if (h_x2[k] != k + 12) {             printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);             break;         }         if (h_cells[k].x1 != k + 11) {             printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);             break;         }         if (h_cells[k].x2 != k + 12) {             printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);             break;         }     }  }

The following are the timings (runs performed on a GTX960):

Array of struct        9.1ms (v1 kernel) Struct of arrays       3.3ms (v3 kernel) Straight arrays        3.2ms (v2 kernel)

answered Oct 07 '22 19:10

Vitality

Related questions
                            
                                In clang-format, what do the penalties do?
                            
                                How can I declare a member vector of the same class?
                            
                                Non-intersecting line segments while minimizing the cumulative length
                            
                                Why are are std::allocator's construct and destroy functions deprecated in c++17?
                            
                                Access to protected member through member-pointer: is it a hack?
                            
                                When to use the Visual Studio Additional dependencies?
                            
                                Why can const char* const & = "hello" compile?
                            
                                Incomplete type in nested name specifier
                            
                                What does the standard say about how calling clear on a vector changes the capacity?
                            
                                Is it possible to get a pointer to one subobject via a pointer to a different, unreleated subobject?
                            
                                Should I use shared_ptr or unique_ptr
                            
                                Why does a C++ friend class need a forward declaration only in other namespaces?
                            
                                Is std::memcpy between different trivially copyable types undefined behavior?
                            
                                what is the performance impact of using int64_t instead of int32_t on 32-bit systems?
                            
                                constexpr if and static_assert
                            
                                QObject: Cannot create children for a parent that is in a different thread
                            
                                Use of string_view for map lookup
                            
                                How to safely read a line from an std::istream?
                            
                                Default argument in the middle of parameter list?
                            
                                Undefined Symbols for architecture x86_64: Compiling problems

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Structure of Arrays vs Array of Structures

Tags:

c++

arrays

struct

cuda

BugShotGG

People also ask

2 Answers

Paul R

Vitality

Recent Activity

Donate For Us