I wonder how should I store N vectors of small dimensions (let's say X, Y, Z) for efficiency.
For cache locality reasons, I was expecting that packing the vectors one after the other [N][3] (row major) would yield better results than the layout [3][N] (in which dimensions X, Y then Z are laid out successively) when doing vectory operations using OpenMP.
However, multiplying each vector by a 3x3 matrix, and using Intel MKL BLAS, I found that the layout [3][N] is twice as fast.
I guess the cache locality is counter-balanced by the fact that SSE instructions work for non-strided data. This lead me to wonder how people (in computer graphics for instance) store their vectors and if there are other pros and cons.
There are two general kinds of data layout used: array of structures (AoS) and structure of arrays (SoA).
AoS:
struct
{
float x;
float y;
float z;
} points[N];
SoA:
struct
{
float x[N];
float y[N];
float z[N];
} points;
In order to multiply each point in the AoS case with a 3x3 matrix M, the body of the loop looks like:
r[i].x = M[0][0]*points[i].x +
M[0][1]*points[i].y +
M[0][2]*points[i].z;
// ditto for r[i].y and r[i].z
SSE can multiply 4 floats (AVX can multiply 8 floats) at once and it also provides dot product operation but the problem is that loading 3 floats in a vector register is very inefficient operation. One can add an additional float field to pad the structure, but still 1/4 of the compute power is lost since the 4-th float in both vector operands goes unused (or contains no useful information). You can also not vectorise over the points, e.g. process 4 points at once, since loading points[i].x to points[i+3].x at once requires gather load, which is not supported (yet) on x86 (though this would change once AVX2 capable CPUs become available).
In the SoA case, the inner loop is:
r.x[i] = M[0][0]*points.x[i] +
M[0][1]*points.y[i] +
M[0][2]*points.z[i];
// ditto for r.y[i] and r.z[i]
It basically looks the same, but there is a very crucial difference. Now the compiler can use vector instructions and process 4 points (or even 8 points with AVX) at once. E.g. it may unroll the loop and transform it into the following vector equivalent:
<r.x[i], r.x[i+1], r.x[i+2], r.x[i+3]> =
M[0][0]*<x[i], x[i+1], x[i+2], x[i+3]> +
M[0][1]*<y[i], y[i+1], y[i+2], y[i+3]> +
M[0][2]*<z[i], z[i+1], z[i+2], z[i+3]>
There are three vector loads, three scalar-vector multiplications, three vector additions, and one vector store here. All of them utilise 100% of the vector capabilities of SSE. The only problem is when the number of points is not divisible by 4, but one can easily pad the arrays or the compiler might generate scalar code to perform the remaining iterations in serial. Either way, if you have many points, it is much more beneficial to only lose some performance for the remaining 1 to 3 points than to constantly underutilise the hardware at each point.
On the other hand, if you frequently need to access the (x,y,z) tuple of random points, then the SoA implementation would result in three cache line reads (if data does not fit in the cache) while the AoS implementation would require one or two (with padding one can evade cases where two loads are necessary). So the answer is - the data structure depends on what kind of operations dominate the algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With