Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to auto-vectorize strided writes with GCC?

When compiled with GCC 5.2 using -std=c99, -O3, and -mavx2, the following code sample auto-vectorizes (assembly here):

#include <stdint.h>

void test(uint32_t *restrict a,
          uint32_t *restrict b) {
  uint32_t *a_aligned = __builtin_assume_aligned(a, 32);
  uint32_t *b_aligned = __builtin_assume_aligned(b, 32);

  for (int i = 0; i < (1L << 10); i += 2) {
    a_aligned[i] = 42 * b_aligned[i];
    a_aligned[i+1] = 3 * a_aligned[i+1];
  }
}

But the following code sample does not auto-vectorize (assembly here):

#include <stdint.h>

void test(uint32_t *restrict a,
          uint32_t *restrict b) {
  uint32_t *a_aligned = __builtin_assume_aligned(a, 32);
  uint32_t *b_aligned = __builtin_assume_aligned(b, 32);

  for (int i = 0; i < (1L << 10); i += 2) {
    a_aligned[i] = 42 * b_aligned[i];
    a_aligned[i+1] = a_aligned[i+1];
  }
}

The only difference between the samples is the scaling factor to a_aligned[i+1].

This was also the case for GCC 4.8, 4.9, and 5.1. Adding volatile to a_aligned's declaration inhibits auto-vectorization completely. The first sample consistently runs faster than the second for us, with a more pronounced speedup for smaller types (e.g. uint8_t instead of uint32_t).

Is there a way to make the second code sample auto-vectorize with GCC?

like image 241
T. Wagner Avatar asked Oct 17 '15 23:10

T. Wagner


People also ask

How is vectorization faster?

A major reason why vectorization is faster than its for loop counterpart is due to the underlying implementation of Numpy operations. As many of you know (if you're familiar with Python), Python is a dynamically typed language.

What does it mean to vectorize a loop?

Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.

What does it mean to vectorize a code?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

What is SLP vectorization?

The SLP Vectorizer The goal of SLP vectorization (a.k.a. superword-level parallelism) is to combine similar independent instructions into vector instructions. Memory accesses, arithmetic operations, comparison operations, PHI-nodes, can all be vectorized using this technique.

Can we develop a loop vectorizer in GCC?

The goal of this project is to develop a loop vectorizer in GCC, based on the tree-ssa framework. This work is taking place in the autovect-branch. Vectorization of loops that operate on multiple data-types, including type conversions: submitted for incorporation into GCC 4.2.

What are the limitations of auto-vectorization in GCC?

There are many restrictions conditions to consider auto-vectorization. gcc needs confirmation that arrays are aligned and data is aligned. Also, code will most likely have to be re-written to simplify loop functionality and even then auto-vectorization isn’t guaranteed.

How to implement auto-vectorization in C?

To demonstrate how to successfully implement auto-vectorization we will create a simple C program that: fills both arrays with random numbers in the range -1000 to +1000 sums both arrays element-by-element to a third array sum the third array and display the result Confirming a successful auto-vectorization can be a little tricky.

What is the vectorization extension of GCC?

gcc has another extension that helps with vectorization, vector types. It is possible to construct types that represent arrays (i.e. vectors) of smaller more basic types. Then, code can use normal C (or C++) operations on those types.


1 Answers

The following version vectorises, but that's ugly if you ask me...

#include <stdint.h>

void test(uint32_t *a, uint32_t *aa,
          uint32_t *restrict b) {
  #pragma omp simd aligned(a,aa,b:32)
  for (int i = 0; i < (1L << 10); i += 2) {
    a[i] = 2 * b[i];
    a[i+1] = aa[i+1];
  }
}

To compile with -fopenmp and to call with test(a, a, b).

like image 98
Gilles Avatar answered Oct 18 '22 18:10

Gilles