I am trying to make sense of how much stuff I can pack into vector hardware. Taking for example an Intel AVX-512 capable piece of hardware I can fit either 8 doubles (64-bit) or 16 singles (32-bit) into my vector. However, if I am running on a 64-bit machine, then it is likely my default pointer size is 64-bit. Hence, if I want to de-reference a pointer (or just access and array using the array syntax) then this would require a 64-bit integer operation. This seems to suggest to me that on a 64-bit machine the minimum partitioning I can have would be 64-bit data types.
Consider then the MWE I have below, where I would hope the compiler would see I am only handling 32-bit objects (or smaller). Given I would anticipate the reduction/calculation (supposing I was doing something more computationally intensive and less bandwidth limited) to be done in half the time if I could partition the vector into 32-bit data types over using the 64-bit data types.
It seems to me that if I have vector registers and I want to do vector operations then if I require
n
vector registers, where each register is split into data types ofm
-bits, then any section of code I wish to vectorise cannot use data-types larger thanm
. (?)
MWE
Compiled using icc
18.0.0 with -mkl -O2 -qopenmp -qopt-report
where the optimisation report verifies the for loop vectorised.
#include <stdlib.h>
#include <stdio.h>
#define N 1024
int main(int argc, char **argv)
{
unsigned int a[N];
for (unsigned int i = 0; i < N; i++) a[i] = i;
unsigned int z[N];
unsigned int *b = a;
printf("Sizes (Bytes)\n");
printf("Pointer = %d\n", sizeof(b));
printf("Unsigned int = %d\n", sizeof(*b));
printf("Array = %d\n\n", sizeof(a));
unsigned int sum = 0;
#pragma omp simd reduction(+:sum)
for (unsigned int i = 0; i < N; i++)
{
z[i] = 4 * a[i];
unsigned int squares = a[i] * a[i]; // Possibly some more complex sequence of operations.
sum += squares;
}
for (unsigned int i = 0; i < N; i += N/4) printf("z[%d] = %d\n", i, z[i]);
printf("\nsum = %d\n", sum);
}
Output on my machine being:
Sizes (Bytes)
Pointer = 8
Unsigned int = 4
Array = 4096
z[0] = 0
z[256] = 1024
z[512] = 2048
z[768] = 3072
sum = 357389824
This seems to suggest to me that on a 64-bit machine the minimum partitioning I can have would be 64-bit data types.
This assumption is wrong.
To illustrate with an (awkward) analogy, the length of a postal address (in symbols) does not correlate with the size of a house. The width of a pointer does not correlate with size of data it references.
There is a lower bound of how small piece of data can be addressed on a given type of hardware. It's called byte (8 bits a.k.a. octet on modern machines, but it can also be 10 or 6 bits like on ancient generations). There is typically no higher bound, however. In Intel 64, as one example, XSAVE family of instructions references a memory block that is nearly 4 kbyte long, with the same 32/64 bit pointers.
Taking for example an Intel AVX-512 capable piece of hardware I can fit either 8 doubles (64-bit) or 16 singles (32-bit) into my vector.
Or you can fit 32 half-floats (16-bit) or 64 bytes. Not sure if there are AVX-512 instructions operating on nibbles (4 bit chunks).
Is there a way to query the granularity of the vector partitioning that the compiler has used? (Avoiding digging through the resulting assembly).
Again, the lower bound for compiler's choice is dictated by width of chosen data types in your program. If you use int
, the granularity will be at least sizeof(int)
bytes, if long
— sizeof(long)
etc. It is unlikely that a type wider than necessary will be used, because it would cause semantical differences of machine instructions that should be accounted for. For example, if a compiler, for unknown reasons, chooses to use a SIMD vector partitioned into uint64_t
chunks to operate on a vector of uint32_t
chunks, then it would have to hide differences in overflow behavior, and that would incur performance penalty.
I do not know if there are OMP pragmas to query for such information. It is unlikely, given that the same binary may have multiple code paths chosen dynamically at runtime (program's startup, so called dispatching used by Intel compiler at least), so the compile-time querying is out of the question, and I cannot see much use in runtime querying.
On a 64-bit machine, how is the vector partitioned into data types less than 64-bit if a 64-bit memory address is assumed?
There are simply machine instructions that interpret the same SIMD registers differently. In Intel 64, as an example there are all sorts of them (examples taken from recent Intel Software Development Manual):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With