My question is in regard of the following phrase from the book:
Unfortunately, the SoA form is not ideal in all circumstances. For random or incoherent circumstances, gathers are used to access the data and the SoA form can result in extra unneeded data being read into cache, thus reducing performance. In this case, use of the AoS form instead will result in a smaller working set and improved performance. Generally, though, if the computation is to be vectorized, the SoA form is preferred.
My guess on why AoS may result in better performance is when different, or better all, fields in the same structure are participating in the single vectorization run.
Example (just a concept, no concrete, or working code at all):
/*Note that the types of data I maintain the same intentionally,
to simplify discussion*/
struct Data {
float mean;
float distribution[10]
}
and define array of those got randomly from some data source
Data aos[5];
now, if during the vectorization loop I do something like:
float* dataPtr = &(aos[0].mean);
#pragma simd
for(int i=0; i< 60; i++)
{
const float mean = (*dataPtr);
/*do something with mean */
dataPtr++;
/*do something with distribution */
}
this will result in better performance, cause in case of a SoA, I will push on cache line more information that I may actually require during this computation. Some CPU pre-caching ? That in case of AoS results in better performance instead.
Are my assumption correct, or there is something else ?
Array refers to a collection consisting of elements of homogeneous data type. Structure refers to a collection consisting of elements of heterogeneous data type. Array is pointer as it points to the first element of the collection. Instantiation of Array objects is not possible.
In computing, array of structures (AoS), structure of arrays (SoA) and array of structures of arrays (AoSoA) refer to contrasting ways to arrange a sequence of records in memory, with regard to interleaving, and are of interest in SIMD and SIMT programming.
Hence, C enables us to declare an array of structures. We can evade declaring the different structure variables; instead, we can make an array containing all the structures that store the data of separate entities.
You can parallelize your program in two ways: horizontally and vertically. I think you are mixing those two approaches.
Horizontal parallelization treats each lane in your SIMD unit as a separate "thread" working on a different data. Vertical parallelization takes whole SIMD unit working on the same data object, trying to benefit from its inner multi-dimensionality.
To give a concrete example: consider you have 2 arrays X
and Y
of 3D vectors that you want to add.
Horizontal approach: each lane of the SIMD unit would do:
for(idx = 0; idx<size; idx+=SIMD_size) {
... = X[idx+laneid].x + Y[idx+laneid].x;
... = X[idx+laneid].y + Y[idx+laneid].y;
... = X[idx+laneid].z + Y[idx+laneid].z;
}
Vertical approach: each lane of the SIMD unit takes a different component of the same vector:
for(idx = 0; idx<size; idx+=1) {
... = X[idx].coord(laneid) + Y[idx].coord(laneid);
}
Vertical approach is easier to implement. In fact, compilers are trying to auto-vectorize already. The problem is that as the width of the SIMD unit is growing, the implementation cannot benefit from it. If you switch from 4-wide to 16-wide SIMD, you are still adding up only 3 numbers in parallel of your 3D vector.
Horizontal approach is harder. You usually have to handle diverging branches, function calls, etc... and - you want to reorganize your data into Structure-of-Arrays - so that the corresponding fields of your different data object are next to each other in memory.
Now, back to your question: SoA makes sense only if you do horizontal parallelization. When each lane is access the same field of different object, SoA allows to replace an expensive gather instruction with a better aligned single memory fetch. If you try to do vertical, as in your example in the question - no one would even consider doing SoA in the first place - accessing multiple fields of same object would cause "gather".
However, with random access, SoA may not be the best option even if you do horizontal parallelization. First, you get no benefit of having SoA because you still need to do the expensive gather. However, as your fields of the same object are spread across the memory, each load is going to hit a different cache lane. Not only it increases the usage of memory bandwidth, it may also cause cache thrashing. This is why SoA are not that efficient with random access.
A better solution is to have a hybrid approach: You pack your data in an Array-of-Structures-of-Arrays-of-SIMD-with-size. But that is another story...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With