I was trying to vectorize a loop that contains the use of the 'pow' function in the math library. I am aware intel compiler supports use of 'pow' for sse instructions - but I can't seem to get it to run with gcc ( I think ). This is the case I am working with:
int main(){
int i=0;
float a[256],
b[256];
float x= 2.3;
for (i =0 ; i<256; i++){
a[i]=1.5;
}
for (i=0; i<256; i++){
b[i]=pow(a[i],x);
}
for (i=0; i<256; i++){
b[i]=a[i]*a[i];
}
return 0;
}
I'm compiling with the following:
gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis
This is on os X 10.5.8 using gcc version 4.2 (I used 4.5 as well and couldn't tell if it had vectorized anything - as it didn't output anything at all). It appears that none of the loops vectorize - is there an allignment issue or some other issue that I need t use restrict? If I write one of the loops as a function I get slightly more verbose output(code):
void pow2(float *a, float * b, int n) {
int i;
for (i=0; i<n; i++){
b[i]=a[i]*a[i];
}
}
output (using level 7 verbose output):
note: not vectorized: can't determine dependence between *D.2878_13 and *D.2877_8
bad data dependence.
I looked at the gcc auto-vectorization page but that didnt' help to much. If it is not possible to use pow in the gcc version what where could I find the resource to do a pow - equivalent function (I'm mostly dealing with integer powers).
Edit so I was just digging into so other source- how did it vectorize this?!:
void array_op(double * d,int len,double value,void (*f)(double*,double*) ) {
for ( int i = 0; i < len; i++ ){
f(&d[i],&value);
}
};
The relevant gcc output:
note: Profitability threshold is 3 loop iterations.
note: LOOP VECTORIZED.
Well now I'm at a loss -- 'd' and 'value' are modified by a function that gcc doesn't know about - strange? Maybe I need to test this portion a little more thoroughly to make sure the results are correct for the vectorized portion. Still looking for a vectorized math library - why aren't there any open source ones?
Using __restrict
or consuming inputs (assigning to local vars) before writing outputs should help.
As it is now, the compiler cannot vectorize because a
might alias b
, so doing 4 multiplies in parallel and writing back 4 values might not be correct.
(Note that __restrict
won't guarantee that the compiler vectorizes, but so much can be said that right now, it sure cannot).
This is not really an answer to your question; but rather a suggestion for how might be able to avoid this issue entirely.
You mention that you're on OS X; there are already APIs on that platform that provide the operations you're looking at, without any need for auto-vectorization. Is there some reason that you aren't using them instead? Auto-vectorization is really cool, but it requires some work, and in general it doesn't produce results that are as good as using APIs that are already vectorized for you.
#include <string.h>
#include <Accelerate/Accelerate.h>
int main() {
int n = 256;
float a[256],
b[256];
// You can initialize the elements of a vector to a set value using memset_pattern:
float threehalves = 1.5f;
memset_pattern4(a, &threehalves, 4*n);
// Since you have a fixed exponent for all of the base values, we will use
// the vImage gamma functions. If you wanted to have different exponents
// for each input (i.e. from an array of exponents), you would use the vForce
// vvpowf( ) function instead (also part of Accelerate).
//
// If you don't need full accuracy, replace kvImageGamma_UseGammaValue with
// kvImageGamma_UseGammaValue_half_precision to get better performance.
GammaFunction func = vImageCreateGammaFunction(2.3f, kvImageGamma_UseGammaValue, 0);
vImage_Buffer src = { .data = a, .height = 1, .width = n, .rowBytes = 4*n };
vImage_Buffer dst = { .data = b, .height = 1, .width = n, .rowBytes = 4*n };
vImageGamma_PlanarF(&src, &dst, func, 0);
vImageDestroyGammaFunction(func);
// To simply square a instead, use the vDSP_vsq function.
vDSP_vsq(a, 1, b, 1, n);
return 0;
}
More generally, unless your algorithm is quite simple, auto-vectorization is unlikely to deliver great results. In my experience, the spectrum of vectorization techniques usually looks about like this:
better performance worse performance
more effort less effort
+------+------+----------------------+----------------------------+-----------+
| | | | | |
| | use vectorized APIs | auto vectorization |
| skilled vector C | scalar code
hand written assembly unskilled vector C
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With