I've recently been reading about AoS vs SoA structure design and data-oriented design. It's oddly difficult to find information about either, and what I have found seems to assume greater understanding of processor functionality than I possess. That said, what I do understand about the former topic in particular leads to some questions that I think I should be able to understand the answers to.
Firstly, to make sure I am not basing my understanding off of a false premise, my understanding of the functionality and pros and cons of AoS vs SoA, as applied to a collection of 'Person' records with 'Name' and 'Age' fields associated with them:
People
object with fields Names
as an array of strings and Ages
as an array of integers.People.Names[2]
and People.Ages[2]
People
array of Person
objects, which have Name
as a string field and Age
as an integer field.People[2].Name
and People[2].Age
Person
structure makes writing code in most programming languages much more straightforward.The long and short of it seems to be that, assuming for the sake of argument that your bottleneck for performance is data access and ease of coding is irrelevant, if you almost exclusively need to access a single field at a time on a large amount of data SoA is likely to be more performant while if you often need to access multiple fields from the same object or deal with single objects rather than many at once, AoS will be more performant.
That said, some of what I've been reading seems to muddy the picture. Firstly, multiple sources have stated that SoA requires indexed addressing which is claimed to be inefficient. I cannot make sense of this, and have been unable to find any explanations. It seems to me that AoS and SoA require exactly the same operations to access any particular piece of data, though in differing orders, except that SoA requires an additional pointer (possibly more than one, depending on the kind of structure used). Oversimplifying a bit, to get the age of the fifth person in my above example under AoS, you would first get the pointer to the array, add 4 to it, get the structure pointer at that element of the array, add the size of a string pointer to it since the age is the second field, then access the integer at that pointer. Under SoA, you would get the pointer to the structure and add the size of a string array pointer to it to get to the list of ages, then get the pointer to the list of integers stored there and add 4 to it, then get the integer stored there.
Secondly, it's not clear to me the degree to which the benefits from SoA are reliant on particular CPU architectures. On the one hand, what I understand of the benefits as described above does not rely on any particular architecture except that SIMD instructions can provide additional benefits not available under AoS in some cases. On the other, I've seen claims that SoA's benefits can be limited depending on the number of lanes available in a particular SIMD architecture. Again, that would seem to affect only the additional benefit that the SIMD instructions can provide over the more general cache benefit.
Finally, I've seen the claim that SoA can require more cache ways when traversing data. I'm not completely sure what cache ways are or what, if anything, specifically is meant by 'traversing' data. My best guess is that 'cache ways' either refers to or correlates with the number of potential collisions in an associative cache, and that it relates to the second Con I mentioned above.
"traversing" just means looping over the data.
And yes, you're right about cache ways and collisions. 64B (cache line size) blocks of memory that are offset from each other by a large power of 2 map to the same set, and thus compete with each other for ways in that set, instead of being cached in different sets. (e.g. Intel's L1 data caches are 32kiB, 8-way associative, with 64B lines. There are 32kiB / 64 B/line = 512 lines
grouped into 512 lines / 8 ways/set = 64 sets
.
Loading 9 items offset from each other by 4kiB (64B/line * 64 sets
, not coincidentally the page size) will evict the first one.
L2 and L3 caches are more highly associative, like 16 or 24 way, but still susceptible to "aliasing" like this, just like a hash table, where there's lots of demand for some sets (buckets) and no demand for other sets (buckets). For CPU caches, the "hash function" is almost always to use some of the address bits as an index, and to ignore the other bits. (The high bits of an address are used as the tag, to determine if any way in the set is actually caching the requested block, and the low bits are used to select bytes within the cache line.)
I think the SoA benefit is mostly from SIMD (auto-vectorization or manual), but also if you tend to loop over your data looking at only one or two fields from most structs, and only accessing the rest in rare cases where you find an interesting one based on one member.
A hybrid approach with separate arrays for each thing (or group of things) you look at together could make sense, with the rest of the data for each object in an array of structs. I'm imagining a linear search loop where most objects are rejected based on looking at one int
field, but for the few objects that pass that test, you look at all the fields.
Grouping together the fields that are mostly accessed together gives you the benefit of spatial locality for these accesses, while still letting search loops that check the key field loop over contiguous memory (rather than with a big stride).
I'm currently experimenting with a layout that interleaves in SIMD vector sized groups. Most of the code that traverses the data needs all fields from every object, and doing it this way means the loop only needs one pointer, and all the memory is allocated as a single block.
This is for collision-detection masks (in a 2D space game (Endless Sky) where it's all collision between a line segment and a ship outline (traced automatically from the sprite), not between two polygons). Here's the original which looped over a vector of double
x,y pairs (and used some (non-inline!) functions to operate on them as a 16B SIMD vector, often with slow SSE3 horizontal-add instructions and stuff like that :( ).
SSE2/SSE3 on XY pairs is probably better than nothing if you can't change the data layout, but changing the layout removes all the shuffling for doing 4 cross products in parallel. See the slides from this SIMD (SSE) intro at Insomniac Games (GDC 2015). It starts off with very basic stuff for people who haven't done anything with SIMD before, and explains exactly how structs of arrays are helpful. By the end, it gets to intermediate/advanced SSE techniques, so it's worth flipping through even if you already know some SIMD stuff. See also the sse tag wiki for some other links.
Anyway, this is the interleave data structure I came up with:
class Mask { ... struct xy_interleave { static constexpr unsigned vecSize = 4; static constexpr unsigned alignMask = vecSize-1; alignas(64) float x[vecSize]; float y[vecSize]; // TODO: reduce cache footprint by calculating this on the fly, maybe with an unaligned load? float dx[vecSize]; // next - current; next.x = x+dx float dy[vecSize]; }; std::vector<xy_interleave> outline_simd; }
Then I can loop over it with stuff like (real code here: this is my work-in-progress not-cleaned-up code that's not ready to be sent upstream)
__m128 minus_point_ps = _mm_cvtpd_ps(-point); // + is commutative, which helps the compiler with AVX const __m128 minus_px = _mm_set1_ps(minus_point_ps[0]); const __m128 minus_py = _mm_set1_ps(minus_point_ps[1]); const __m128 range2 = _mm_set1_ps(float(range*range)); for(const xy_interleave &curr : outline_simd) { __m128 dx = _mm_load_ps(curr.x) + minus_px; __m128 dy = _mm_load_ps(curr.y) + minus_py; // this is using GNU Vector Extensions for + and *, instead of _mm_add_ps and _mm_mul_ps, since GNU C++ defines __m128 in terms of __v4sf __m128 cmp = _mm_cmplt_ps(dx*dx - range2, dy*dy); // transform the inequality for more ILP // load the x and y fields from this group of 4 objects, all of which come from the same cache line. if(_mm_movemask_ps(cmp)) return true; }
This compiles to really nice-looking asm loops, with only one pointer looping over the std::vector, and vector loads from constant offsets relative to that loop pointer.
However, scalar fallback loops over the same data are less pretty. (And actually I use loops like this (with j+=4
) in the manually-vectorized parts, too, so I can change the interleave without breaking the code. It compiles away entirely, or turns into an unroll).
// TODO: write an iterator or something to make this suck less for(const xy_interleave &curr : outline_simd) for (unsigned j = 0; j < curr.vecSize; ++j) { float dx = curr.x[j] - px; float dy = curr.y[j] - py; if(dx*dx + dy*dy < range2) return true; }
Unfortunately I've had no luck getting gcc or clang to auto-vectorize this, even for easy cases with no conditionals (e.g. just finding the min range from a query x,y to any point in the collision-mask, instead of checking if a point is within range).
I might discard this idea and go with separate x and y arrays. (Maybe packed head to tail in the same std::vector<float>
(with an aligned allocator) to keep it part of one allocation, but that would still mean that loops would need separate x and y pointers because the offset between the x and y for a given vertex would be a runtime variable, not a compile-time constant.)
Having all the x
s contiguous would be a big help if I want to stop storing the x[i+1]-x[i]
, and compute it on the fly. With my layout, I would need to shuffle between vectors, instead of just doing an unaligned offset by 1 float.
It would hopefully also allow the compiler to auto-vectorize some of the functions (e.g. for ARM, or for AVX/AVX2 with wider vectors).
Of course, manual vectorization is going to win here, since I'm doing stuff like XORing floats together because I only care about their sign bit as a truth value, instead of doing a compare and then XORing the compare result. (My testing so far has shown that treating negative 0 as negative still gives correct results for Mask::Intersect, but any way to express it in C is going to follow the IEEE rules where x >= 0
is true for x=-0.
).
if you almost exclusively need to access a single field at a time on a large amount of data AoS is likely to be more performant while if you often need to access multiple fields from the same object or deal with single objects rather than many at once, SoA will be more performant.
You have this exactly backwards. Was this a typo? Grouping all the foo[i].key
fields into a foo.key[i]
array means they're all packed together in the cache, so accessing just that one field in many objects means you're using all 64 bytes of every cache line that you touch.
You got it correct earlier when you wrote
When working with only some of the data from many 'Person' records, only that data needs to be loaded into memory.
(except I think you mean "from" memory (into cache), unless you're talking about a memory-mapped file and faulting pages from disk into memory.)
Indexed addressing modes:
In a situation where you're looking at two or three fields in each object, an SoA layout is going to tie up more registers holding separate base addresses for each separate array you're looping over.
With multiple pointers, you're going to want to either use addressing modes like [reg1 + 4*reg2]
on x86, or you're going to need to separately increment a bunch of different pointers inside your loop. Indexed addressing modes are potentially slightly slower on Intel SnB-family, because they can't stay micro-fused with ALU uops in the out-of-order core (only in the decoders and uop cache). Skylake can keep them micro-fused, but further testing is needed to find out when Intel made this change. Perhaps with Broadwell when three-input instructions beyond FMA (like CMOV and ADC) decoded to a single uop, but that's a pure guess. Testing on Haswell and Broadwell is needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With