When compiling with gcc -O3
, why does the following loop not vectorize (automatically):
#define SIZE (65536) int a[SIZE], b[SIZE], c[SIZE]; int foo () { int i, j; for (i=0; i<SIZE; i++){ for (j=i; j<SIZE; j++) { a[i] = b[i] > c[j] ? b[i] : c[j]; } } return a[0]; }
when the following one does?
#define SIZE (65536) int a[SIZE], b[SIZE], c[SIZE]; int foov () { int i, j; for (i=0; i<SIZE; i++){ for (j=i; j<SIZE; j++) { a[i] += b[i] > c[j] ? b[i] : c[j]; } } return a[0]; }
The only difference is whether the result of the expression in the inner loop is assigned to a[i], or added to a[i].
For reference -ftree-vectorizer-verbose=6
gives the following output for the first (non-vectorizing) loop.
v.c:8: note: not vectorized: inner-loop count not invariant. v.c:9: note: Unknown alignment for access: c v.c:9: note: Alignment of access forced using peeling. v.c:9: note: not vectorized: live stmt not supported: D.2700_5 = c[j_20]; v.c:5: note: vectorized 0 loops in function.
And the same output for the loop that vectorizes is:
v.c:8: note: not vectorized: inner-loop count not invariant. v.c:9: note: Unknown alignment for access: c v.c:9: note: Alignment of access forced using peeling. v.c:9: note: vect_model_load_cost: aligned. v.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 . v.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 . v.c:9: note: vect_model_reduction_cost: inside_cost = 1, outside_cost = 6 . v.c:9: note: cost model: prologue peel iters set to vf/2. v.c:9: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown . v.c:9: note: Cost model analysis: Vector inside of loop cost: 3 Vector outside of loop cost: 27 Scalar iteration cost: 3 Scalar outside cost: 7 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 8 v.c:9: note: Profitability threshold = 7 v.c:9: note: Profitability threshold is 7 loop iterations. v.c:9: note: LOOP VECTORIZED. v.c:5: note: vectorized 1 loops in function.
In the first case: the code overwrites the same memory location a[i]
in each iteration. This inherently sequentializes the loop as the loop iterations are not independent.
(In reality, only the final iteration is actually needed. So the entire inner loop could be taken out.)
In the second case: GCC recognizes the loop as a reduction operation - for which it has special case handling to vectorize.
Compiler vectorization is often implemented as some sort of "pattern matching". Meaning the compiler analyzes code to see if it fits a certain pattern that it's able to vectorize. If it does, it gets vectorized. If it doesn't, then it doesn't.
This seems to be a corner case where the first loop doesn't fit any of the pre-coded patterns that GCC can handle. But the second case fits the "vectorizable reduction" pattern.
Here's the relevant part of GCC's source code that spits out that "not vectorized: live stmt not supported: "
message:
http://svn.open64.net/svnroot/open64/trunk/osprey-gcc-4.2.0/gcc/tree-vect-analyze.c
if (STMT_VINFO_LIVE_P (stmt_info)) { ok = vectorizable_reduction (stmt, NULL, NULL); if (ok) need_to_vectorize = true; else ok = vectorizable_live_operation (stmt, NULL, NULL); if (!ok) { if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS)) { fprintf (vect_dump, "not vectorized: live stmt not supported: "); print_generic_expr (vect_dump, stmt, TDF_SLIM); } return false; } }
From just the line:
vectorizable_reduction (stmt, NULL, NULL);
It's clear that GCC is checking to see if it matches a "vectorizable reduction" pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With