When compiled with GCC 5.2 using -std=c99
, -O3
, and -mavx2
, the
following code sample auto-vectorizes (assembly here):
#include <stdint.h>
void test(uint32_t *restrict a,
uint32_t *restrict b) {
uint32_t *a_aligned = __builtin_assume_aligned(a, 32);
uint32_t *b_aligned = __builtin_assume_aligned(b, 32);
for (int i = 0; i < (1L << 10); i += 2) {
a_aligned[i] = 42 * b_aligned[i];
a_aligned[i+1] = 3 * a_aligned[i+1];
}
}
But the following code sample does not auto-vectorize (assembly here):
#include <stdint.h>
void test(uint32_t *restrict a,
uint32_t *restrict b) {
uint32_t *a_aligned = __builtin_assume_aligned(a, 32);
uint32_t *b_aligned = __builtin_assume_aligned(b, 32);
for (int i = 0; i < (1L << 10); i += 2) {
a_aligned[i] = 42 * b_aligned[i];
a_aligned[i+1] = a_aligned[i+1];
}
}
The only difference between the samples is the scaling factor to a_aligned[i+1]
.
This was also the case for GCC 4.8, 4.9, and 5.1. Adding volatile
to a_aligned
's declaration inhibits auto-vectorization completely. The first sample consistently runs faster than the second for us, with a more pronounced speedup for smaller types (e.g. uint8_t
instead of uint32_t
).
Is there a way to make the second code sample auto-vectorize with GCC?
A major reason why vectorization is faster than its for loop counterpart is due to the underlying implementation of Numpy operations. As many of you know (if you're familiar with Python), Python is a dynamically typed language.
Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
The SLP Vectorizer The goal of SLP vectorization (a.k.a. superword-level parallelism) is to combine similar independent instructions into vector instructions. Memory accesses, arithmetic operations, comparison operations, PHI-nodes, can all be vectorized using this technique.
The goal of this project is to develop a loop vectorizer in GCC, based on the tree-ssa framework. This work is taking place in the autovect-branch. Vectorization of loops that operate on multiple data-types, including type conversions: submitted for incorporation into GCC 4.2.
There are many restrictions conditions to consider auto-vectorization. gcc needs confirmation that arrays are aligned and data is aligned. Also, code will most likely have to be re-written to simplify loop functionality and even then auto-vectorization isn’t guaranteed.
To demonstrate how to successfully implement auto-vectorization we will create a simple C program that: fills both arrays with random numbers in the range -1000 to +1000 sums both arrays element-by-element to a third array sum the third array and display the result Confirming a successful auto-vectorization can be a little tricky.
gcc has another extension that helps with vectorization, vector types. It is possible to construct types that represent arrays (i.e. vectors) of smaller more basic types. Then, code can use normal C (or C++) operations on those types.
The following version vectorises, but that's ugly if you ask me...
#include <stdint.h>
void test(uint32_t *a, uint32_t *aa,
uint32_t *restrict b) {
#pragma omp simd aligned(a,aa,b:32)
for (int i = 0; i < (1L << 10); i += 2) {
a[i] = 2 * b[i];
a[i+1] = aa[i+1];
}
}
To compile with -fopenmp
and to call with test(a, a, b)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With