Array of Structures (AoS) vs Structure of Arrays (SoA) on random reads for vectorization

Tags:

My question is in regard of the following phrase from the book:

Unfortunately, the SoA form is not ideal in all circumstances. For random or incoherent circumstances, gathers are used to access the data and the SoA form can result in extra unneeded data being read into cache, thus reducing performance. In this case, use of the AoS form instead will result in a smaller working set and improved performance. Generally, though, if the computation is to be vectorized, the SoA form is preferred.

My guess on why AoS may result in better performance is when different, or better all, fields in the same structure are participating in the single vectorization run.

Example (just a concept, no concrete, or working code at all):

/*Note that the types of data I maintain the same intentionally, 
  to simplify discussion*/
struct Data {
     float mean; 
     float distribution[10]
}

and define array of those got randomly from some data source

Data aos[5];

now, if during the vectorization loop I do something like:

float* dataPtr =  &(aos[0].mean);

#pragma simd
for(int i=0; i< 60; i++)
{
   const float mean = (*dataPtr);
   /*do something with mean */

   dataPtr++;

   /*do something with distribution */
}

this will result in better performance, cause in case of a SoA, I will push on cache line more information that I may actually require during this computation. Some CPU pre-caching ? That in case of AoS results in better performance instead.

Are my assumption correct, or there is something else ?

808

asked Feb 23 '15 14:02

Tigran

1 Answers

You can parallelize your program in two ways: horizontally and vertically. I think you are mixing those two approaches.

Horizontal parallelization treats each lane in your SIMD unit as a separate "thread" working on a different data. Vertical parallelization takes whole SIMD unit working on the same data object, trying to benefit from its inner multi-dimensionality.

To give a concrete example: consider you have 2 arrays X and Y of 3D vectors that you want to add.

Horizontal approach: each lane of the SIMD unit would do:

for(idx = 0; idx<size; idx+=SIMD_size) {
    ... = X[idx+laneid].x + Y[idx+laneid].x;
    ... = X[idx+laneid].y + Y[idx+laneid].y;
    ... = X[idx+laneid].z + Y[idx+laneid].z;
}

Vertical approach: each lane of the SIMD unit takes a different component of the same vector:

for(idx = 0; idx<size; idx+=1) { 
    ... = X[idx].coord(laneid) + Y[idx].coord(laneid);
}

Vertical approach is easier to implement. In fact, compilers are trying to auto-vectorize already. The problem is that as the width of the SIMD unit is growing, the implementation cannot benefit from it. If you switch from 4-wide to 16-wide SIMD, you are still adding up only 3 numbers in parallel of your 3D vector.

Horizontal approach is harder. You usually have to handle diverging branches, function calls, etc... and - you want to reorganize your data into Structure-of-Arrays - so that the corresponding fields of your different data object are next to each other in memory.

Now, back to your question: SoA makes sense only if you do horizontal parallelization. When each lane is access the same field of different object, SoA allows to replace an expensive gather instruction with a better aligned single memory fetch. If you try to do vertical, as in your example in the question - no one would even consider doing SoA in the first place - accessing multiple fields of same object would cause "gather".

However, with random access, SoA may not be the best option even if you do horizontal parallelization. First, you get no benefit of having SoA because you still need to do the expensive gather. However, as your fields of the same object are spread across the memory, each load is going to hit a different cache lane. Not only it increases the usage of memory bandwidth, it may also cause cache thrashing. This is why SoA are not that efficient with random access.

A better solution is to have a hybrid approach: You pack your data in an Array-of-Structures-of-Arrays-of-SIMD-with-size. But that is another story...

196

answered Sep 19 '22 14:09

CygnusX1

Related questions
                            
                                Can I use std::bind to "attach" a state to a function?
                            
                                bit field declaration in a c++ struct
                            
                                How to handle different header file location on different Linux distributions?
                            
                                Undefined symbol: _ZN7QString13toUtf8_helperERKS_ at runtime
                            
                                Find index of first match using C++ regex
                            
                                Constructing std::function target in-place
                            
                                Why do some people put a blank #include at the begining of C++ Programs? [closed]
                            
                                constexpr of static tuple class member has linker error
                            
                                Serializing mutable state and sending it asynchronously over the network with nearly-zero-copy (Cap'n Proto + ZeroMQ)
                            
                                Abbreviated type name long long vs long long int, is it standard-compliant?
                            
                                Implicitly calling copy constructor
                            
                                Ubuntu CMake what path to add to CMAKE_MODULE_PATH
                            
                                Qt for iOS: code signing is required
                            
                                boost::optional and type conversion
                            
                                Observer Pattern Via Boost Signal2
                            
                                How to quickly create arrays in C++
                            
                                Perfect forwarding and std::tuple
                            
                                Why does unique_ptr<T>::~unique_ptr need the definition of T?
                            
                                proper use of std move for a factory class
                            
                                Rcpp Copy a vector to a vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Array of Structures (AoS) vs Structure of Arrays (SoA) on random reads for vectorization

Tags:

c++

vectorization

parallel-processing

cpu-cache

Tigran

People also ask

1 Answers

CygnusX1

Recent Activity

Donate For Us