I've got a little bit of code in my innermost loop that I'm using to clamp some error values for a rasterization algorithm I'm writing:
float cerror[4] = {
MINF(error[0], 1.0f),
MINF(error[1], 1.0f),
MINF(error[2], 1.0f),
MINF(error[3], 1.0f)
};
where MINF is just MINF(a,b) = ((a) < (b)) ? (a) : (b)
It turns out I've got 4 error values I have to update in this inner loop, all floats, so it'd be great if I could get them all stored in SSE registers and have the minimum computed with minps rather than separately, but the compiler doesn't seem to be doing that for me.
I even tried moving it to it's own function so I can see the vectorizer output:
void fclamp4(float* __restrict__ aa, float* __restrict__ bb) {
for (size_t ii=0; ii < 4; ii++) {
aa[ii] = (bb[ii] > 1.0) ? 1.0f : bb[ii];
}
}
Which gives me something like:
inc/simplex.h:1508: note: not vectorized: unsupported data-type bool
inc/simplex.h:1507: note: vectorized 0 loops in function.
Is there a way to better encourage the compiler to do this for me? I'd rather not skip straight to instrinsics if I can avoid it so the code remains portable. Is there perhaps a general reference with common patterns?
Lastly, all of my error/cerror/error increments are stored in float[4] arrays on the stack, do I need to manually align those or can the compiler handle that for me?
Edit: playing around with an aligned type and still no dice.
#include <stdio.h>
#include <stdlib.h>
typedef float __attribute__((aligned (16))) float4[4];
inline void doit(const float4 a, const float4 b, float4 c) {
for (size_t ii=0; ii < 4; ii++) {
c[ii] = (a[ii] < b[ii]) ? a[ii] : b[ii];
}
}
int main() {
float4 a = {rand(), rand(), rand(), rand() };
float4 b = {1.0f, 1.0f, 1.0f, 1.0f };
float4 c;
doit((float*)&a, (float*)&b, (float*)&c);
printf("%f\n", c[0]);
}
The vectorizer says:
ssetest.c:7: note: vect_model_load_cost: aligned.
ssetest.c:7: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
ssetest.c:7: note: vect_model_load_cost: aligned.
ssetest.c:7: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
ssetest.c:7: note: not vectorized: relevant stmt not supported: D.3177D.3177_22 = iftmp.4_18 < iftmp.4_21;ssetest.c:12: note: vectorized 0 loops in function.
Edit again: I should note I've been trying this on GCC 4.4.7 (RHEL 6) and GCC 4.6 (Ubuntu), both without luck.
It looks like in GCC vectorization of reductions isn't enabled unless you specify -ffast-math or -fassociative-math. when I enable those it vectorizes just fine (using fminf in the inner loop):
ssetest.c:9: note: vect_model_load_cost: aligned.
ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_load_cost: aligned.
ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: Cost model analysis:
Vector inside of loop cost: 4
Vector outside of loop cost: 0
Scalar iteration cost: 4
Scalar outside cost: 0
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1ssetest.c:9: note: Profitability threshold = 3
ssetest.c:9: note: LOOP VECTORIZED.
ssetest.c:15: note: vectorized 1 loops in function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With