Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Structure of Arrays vs Array of Structures

From some comments that I have read in here, for some reason it is preferable to have Structure of Arrays (SoA) over Array of Structures (AoS) for parallel implementations like CUDA? If that is true, can anyone explain why? Thanks in advance!

like image 754
BugShotGG Avatar asked Jul 29 '13 12:07

BugShotGG


People also ask

What is the difference between array of structure and structure of array?

Structure can be defined as a data structure used as container which can hold variables of different types. On other hand Array is a type of data structure used as container which can hold variables of same type and do not support multiple data type variables.

What is meant by structure of array?

An array of structures is simply an array in which each element is a structure of the same type. The referencing and subscripting of these arrays (also called structure arrays) follow the same rules as simple arrays.

Whats is the main difference between array & structure?

Array refers to a collection consisting of elements of homogeneous data type. Structure refers to a collection consisting of elements of heterogeneous data type. Array is pointer as it points to the first element of the collection.

What are array of structures give example?

The array of structures in C are used to store information about multiple entities of different data types. The array of structures is also known as the collection of structures. Let's see an example of an array of structures that stores information of 5 students and prints it.


2 Answers

Choice of AoS versus SoA for optimum performance usually depends on access pattern. This is not just limited to CUDA however - similar considerations apply for any architecture where performance can be significantly affected by memory access pattern, e.g. where you have caches or where performance is better with contiguous memory access (e.g. coalesced memory accesses in CUDA).

E.g. for RGB pixels versus separate RGB planes:

struct {     uint8_t r, g, b; } AoS[N];  struct {     uint8_t r[N];     uint8_t g[N];     uint8_t b[N]; } SoA; 

If you are going to be accessing the R/G/B components of each pixel concurrently then AoS usually makes sense, since the successive reads of R, G, B components will be contiguous and usually contained within the same cache line. For CUDA this also means memory read/write coalescing.

However if you are going to process color planes separately then SoA might be preferred, e.g. if you want to scale all R values by some scale factor, then SoA means that all R components will be contiguous.

One further consideration is padding/alignment. For the RGB example above each element in an AoS layout is aligned to a multiple of 3 bytes, which may not be convenient for CUDA, SIMD, et al - in some cases perhaps even requiring padding within the struct to make alignment more convenient (e.g. add a dummy uint8_t element to ensure 4 byte alignment). In the SoA case however the planes are byte aligned which can be more convenient for certain algorithms/architectures.

For most image processing type applications the AoS scenario is much more common, but for other applications, or for specific image processing tasks this may not always be the case. When there is no obvious choice I would recommend AoS as the default choice.

See also this answer for more general discussion of AoS v SoA.

like image 173
Paul R Avatar answered Oct 07 '22 18:10

Paul R


I just want to provide a simple example showing how a Struct of Arrays (SoA) performs better than an Array of Structs (AoS).

In the example, I'm considering three different versions of the same code:

  1. SoA (v1)
  2. Straight arrays (v2)
  3. AoS (v3)

In particular, version 2 considers the use of straight arrays. The timings of versions 2 and 3 are the same for this example and result to be better than version 1. I suspect that, in general, straight arrays could be preferable, although at the expense of readability, since, for example, loading from uniform cache could be enabled through const __restrict__ for this case.

#include "cuda_runtime.h" #include "device_launch_parameters.h"  #include <stdio.h>  #include <thrust\device_vector.h>  #include "Utilities.cuh" #include "TimingGPU.cuh"  #define BLOCKSIZE   1024  /******************************************/ /* CELL STRUCT LEADING TO ARRAY OF STRUCT */ /******************************************/ struct cellAoS {      unsigned int    x1;     unsigned int    x2;     unsigned int    code;     bool            done;  };  /*******************************************/ /* CELL STRUCT LEADING TO STRUCT OF ARRAYS */ /*******************************************/ struct cellSoA {      unsigned int    *x1;     unsigned int    *x2;     unsigned int    *code;     bool            *done;  };   /*******************************************/ /* KERNEL MANIPULATING THE ARRAY OF STRUCT */ /*******************************************/ __global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {      const int tid = threadIdx.x + blockIdx.x * blockDim.x;      if (tid < N) {         cellAoS tempCell = d_cells[tid];          tempCell.x1 = tempCell.x1 + 10;         tempCell.x2 = tempCell.x2 + 10;          d_cells[tid] = tempCell;     }  }  /******************************/ /* KERNEL MANIPULATING ARRAYS */ /******************************/ __global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {      const int tid = threadIdx.x + blockIdx.x * blockDim.x;      if (tid < N) {          d_x1[tid] = d_x1[tid] + 10;         d_x2[tid] = d_x2[tid] + 10;      }  }  /********************************************/ /* KERNEL MANIPULATING THE STRUCT OF ARRAYS */ /********************************************/ __global__ void AoSvsSoA_v3(cellSoA cell, const int N) {      const int tid = threadIdx.x + blockIdx.x * blockDim.x;      if (tid < N) {          cell.x1[tid] = cell.x1[tid] + 10;         cell.x2[tid] = cell.x2[tid] + 10;      }  }  /********/ /* MAIN */ /********/ int main() {      const int N = 2048 * 2048 * 4;      TimingGPU timerGPU;      thrust::host_vector<cellAoS>    h_cells(N);     thrust::device_vector<cellAoS>  d_cells(N);      thrust::host_vector<unsigned int>   h_x1(N);     thrust::host_vector<unsigned int>   h_x2(N);      thrust::device_vector<unsigned int> d_x1(N);     thrust::device_vector<unsigned int> d_x2(N);      for (int k = 0; k < N; k++) {          h_cells[k].x1 = k + 1;         h_cells[k].x2 = k + 2;         h_cells[k].code = k + 3;         h_cells[k].done = true;          h_x1[k] = k + 1;         h_x2[k] = k + 2;      }      d_cells = h_cells;      d_x1 = h_x1;     d_x2 = h_x2;      cellSoA cell;     cell.x1 = thrust::raw_pointer_cast(d_x1.data());     cell.x2 = thrust::raw_pointer_cast(d_x2.data());     cell.code = NULL;     cell.done = NULL;      timerGPU.StartCounter();     AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);     gpuErrchk(cudaPeekAtLastError());     gpuErrchk(cudaDeviceSynchronize());     printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());      //timerGPU.StartCounter();     //AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);     //gpuErrchk(cudaPeekAtLastError());     //gpuErrchk(cudaDeviceSynchronize());     //printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());      timerGPU.StartCounter();     AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);     gpuErrchk(cudaPeekAtLastError());     gpuErrchk(cudaDeviceSynchronize());     printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());      h_cells = d_cells;      h_x1 = d_x1;     h_x2 = d_x2;      // --- Check results     for (int k = 0; k < N; k++) {         if (h_x1[k] != k + 11) {             printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);             break;         }         if (h_x2[k] != k + 12) {             printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);             break;         }         if (h_cells[k].x1 != k + 11) {             printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);             break;         }         if (h_cells[k].x2 != k + 12) {             printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);             break;         }     }  } 

The following are the timings (runs performed on a GTX960):

Array of struct        9.1ms (v1 kernel) Struct of arrays       3.3ms (v3 kernel) Straight arrays        3.2ms (v2 kernel) 
like image 40
Vitality Avatar answered Oct 07 '22 19:10

Vitality