I implemented the following code which converts 32 bytes of input to uppercase:
Version 1:
void to_upper(char* input) {
for (int i = 0; i < 32; ++i) {
input[i] = (input[i] >= 'a' && input[i] <= 'z') ? input[i] - 32 : input[i];
}
}
Version 2:
void to_upper(char* input) {
for (int i = 0; i < 32; ++i) {
if (input[i] >= 'a' && input[i] <= 'z') {
input[i] = input[i] - 32; // same for: input[i] -= 32;
}
}
}
The first version gets autovectorized, the second doesn't. The behaviour is consistent across clang and gcc. Moreover, I implemented both versions in Rust, too, and the Rust compiler does the same, Version 1 autovectorized, Version 2 isn't.
What is limiting the compiler from vectorizing the second version?
Vectorization means that the compiler detects that your independent instructions can be executed as one SIMD instruction. Usual example is that if you do something like for(i=0; i<N; i++){ a[i] = a[i] + b[i]; }
Performing the exact same operation on every entry in a structure (but with different values) is the essence of “vectorized code.” When we wrote a for -loop in xlin_fits_R () to perform the same steps for each index, we were in fact fighting the language.
Vectorized code has many advantages, among which are: the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs) vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.
The syntax in the vectorization approach when the complier see that, it will translate it into optimized CPU instructions that multiplies vectors. Like SIMD. @mskw: That's pseudo-code, not actual syntax for a C vector extension.
Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once. isn't that in essence same as Scalar approach?
Basically optimization passes have harder time recognizing the conditional assignment in second loop.
In first case GCC generates a slightly different intermediate representation which allows if-conversion pass to convert code to vectorizable form:
<bb 3>:
# i_18 = PHI <i_14(4), 0(2)>
# ivtmp_24 = PHI <ivtmp_21(4), 32(2)>
_6 = (sizetype) i_18;
_7 = input_5(D) + _6;
_8 = *_7;
_9 = (unsigned char) _8;
_10 = _9 + 159;
_11 = _9 + 224;
iftmp.0_12 = (char) _11;
iftmp.0_2 = _10 <= 25 ? iftmp.0_12 : _8;
*_7 = iftmp.0_2;
i_14 = i_18 + 1;
ivtmp_21 = ivtmp_24 - 1;
if (ivtmp_21 != 0)
goto <bb 4>;
else
goto <bb 5>;
whereas in second case code contains spurious jumps which complicate analysis and break vectorization:
<bb 3>:
# i_15 = PHI <i_14(6), 0(2)>
# ivtmp_26 = PHI <ivtmp_25(6), 32(2)>
_5 = (sizetype) i_15;
_7 = input_6(D) + _5;
_8 = *_7;
_9 = (unsigned char) _8;
_10 = _9 + 159;
if (_10 <= 25)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
_11 = _9 + 224;
_12 = (char) _11;
*_7 = _12;
<bb 5>:
i_14 = i_15 + 1;
ivtmp_25 = ivtmp_26 - 1;
if (ivtmp_25 != 0)
goto <bb 6>;
else
goto <bb 7>;
Many optimization passes work as pattern matchers which recognize and optimize common cases so I wouldn't be surprised with this behavior.
You can try filing a bug in GCC tracker.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With