Initially investigating the effect of the #pragma omp simd
directive, I came across a behaviour that I cannot explain, related to the vectorization of a simple for loop. The following code sample can be tested on this awesome compiler explorer, provided the -O3 directive is applied and we're on the x86 architecture.
Could anybody explain me the logic behind the following observations ?
#include <stdint.h>
void test(uint8_t* out, uint8_t const* in, uint32_t length)
{
unsigned const l1 = (length * 32)/32; // This is vectorized
unsigned const l2 = (length / 32)*32; // This is not vectorized
unsigned const l3 = (length << 5)>>5; // This is vectorized
unsigned const l4 = (length >> 5)<<5; // This is not vectorized
unsigned const l5 = length -length%32; // This is not vectorized
unsigned const l6 = length & ~(32 -1); // This is not vectorized
for (unsigned i = 0; i<l1 /*pick your choice*/; ++i)
{
out[i] = in[i*2];
}
}
What puzzles me is that both l1 and l3 generate vectorized code in spite of not beeing guaranteed to be multiples of 32. All of the other lengths do not produce vectorized code, but should be multiples of 32. Is there a reason behind this ?
As an aside, using the #pragma omp simd directive doesn't actually change anything.
Edit: After further investigation, the difference of behaviour disappears when the index type is size_t (and no boundary manipulation is even needed), meaning that this generates vectorized code :
#include <stdint.h>
#include <string>
void test(uint8_t* out, uint8_t const* in, size_t length)
{
for (size_t i = 0; i<length; ++i)
{
out[i] = in[i*2];
}
}
If somebody know why the loop vectorizing is so dependent on the index type then, I'd be curious to know more !
Edit2, thanks to Mark Lakata, O3 is actually needed
The issue is apparent conversion1 from unsigned
to size_t
in the array index: in[i*2];
If you use l1
or l3
then the computation of i*2
will always fit into the type size_t
. This means that the type unsigned
practically behaves as if it were size_t
.
But when you use the other options, the result of the computation i*2
can possibly not fit into size_t
as the value might wrap and the conversion must be made.
if you take your first example, not choosing options l1 or l3, and do the cast:
out[i] = in[( size_t )i*2];
the compiler optimizes, if you cast the whole expression:
out[i] = in[( size_t )(i*2)];
it doesn't.
1 The Standard doesn't actually specify that the type in the index must be size_t
, but it is a logical step from the compiler perspective.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With