Puzzling GCC behaviour with respect to vectorization and loop size

Question

Initially investigating the effect of the #pragma omp simd directive, I came across a behaviour that I cannot explain, related to the vectorization of a simple for loop. The following code sample can be tested on this awesome compiler explorer, provided the -O3 directive is applied and we're on the x86 architecture.

Could anybody explain me the logic behind the following observations ?

#include <stdint.h> 

void test(uint8_t* out, uint8_t const* in, uint32_t length)
{
    unsigned const l1 = (length * 32)/32;  // This is vectorized
    unsigned const l2 = (length / 32)*32;  // This is not vectorized

    unsigned const l3 = (length << 5)>>5;  // This is vectorized
    unsigned const l4 = (length >> 5)<<5;  // This is not vectorized

    unsigned const l5 = length -length%32; // This is not vectorized
    unsigned const l6 = length & ~(32 -1); // This is not vectorized

    for (unsigned i = 0; i<l1 /*pick your choice*/; ++i)
    {
      out[i] = in[i*2];
    }
}

What puzzles me is that both l1 and l3 generate vectorized code in spite of not beeing guaranteed to be multiples of 32. All of the other lengths do not produce vectorized code, but should be multiples of 32. Is there a reason behind this ?

As an aside, using the #pragma omp simd directive doesn't actually change anything.

Edit: After further investigation, the difference of behaviour disappears when the index type is size_t (and no boundary manipulation is even needed), meaning that this generates vectorized code :

#include <stdint.h> 
#include <string>

void test(uint8_t* out, uint8_t const* in, size_t length)
{
    for (size_t i = 0; i<length; ++i)
    {
        out[i] = in[i*2];
    }
}

If somebody know why the loop vectorizing is so dependent on the index type then, I'd be curious to know more !

Edit2, thanks to Mark Lakata, O3 is actually needed

2501 · Accepted Answer

The issue is apparent conversion¹ from unsigned to size_t in the array index: in[i*2];

If you use l1 or l3 then the computation of i*2 will always fit into the type size_t. This means that the type unsigned practically behaves as if it were size_t.

But when you use the other options, the result of the computation i*2 can possibly not fit into size_t as the value might wrap and the conversion must be made.

if you take your first example, not choosing options l1 or l3, and do the cast:

out[i] = in[( size_t )i*2];

the compiler optimizes, if you cast the whole expression:

out[i] = in[( size_t )(i*2)];

it doesn't.

¹ The Standard doesn't actually specify that the type in the index must be size_t, but it is a logical step from the compiler perspective.

Puzzling GCC behaviour with respect to vectorization and loop size

Tags:

c++

c

gcc

vector

auto-vectorization

Benjamin Lefaudeux

Video Answer

1 Answers

2501

Recent Activity

Donate For Us

Puzzling GCC behaviour with respect to vectorization and loop size

Tags:

c++

c

gcc

vector

auto-vectorization

Benjamin Lefaudeux

Video Answer

1 Answers

2501

Related questions

Recent Activity

Donate For Us