I am compiling my Fortran
code using gfortran
and -mavx
and have verified that some instructions are vectorized via objdump
, but I'm not getting the speed improvements that I was expecting, so I want to make sure the following argument is being vectorized (this single instruction is ~50% of the runtime).
I know that some instructions can be vectorized, while others cannot, so I want to make sure this can be:
sum(A(i1:i2,ir))
Again, this single line takes about 50% of the runtime since I am doing this over a very large matrix. I can give more information on why I am doing this, but suffice it to say that it is necessary, though I can restructure the memory if necessary (for example, I could do the sum as sum(A(ir,i1:i2))
if that could be vectorized instead.
Is this line being vectorized? How can I tell? How do I force vectorization if it is not being vectorized?
EDIT: Thanks to the comments, I now realize that I can check on the vectorization of this summation via -ftree-vectorizer-verbose
and see that this is not vectorizing. I have restructured the code as follows:
tsum = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn
tsum = tsum + tvec(ii)
enddo
and this ONLY vectorizes when I turn on -funsafe-math-optimizations
, but I do see another 70% speed increase due to vectorization. The question still holds: Why does sum(A(i1:i2,ir))
not vectorize and how can I get a simple sum
to vectorize?
It turns out that I am not able to make use of the vectorization unless I include -ffast-math
or -funsafe-math-optimizations
.
The two code snippets I played with are:
tsum = 0.0d0
tvec(1:n) = A(i1:i2, ir)
do ii = 1,n
tsum = tsum + tvec(ii)
enddo
and
tsum = sum(A(i1:i2,ir))
and here are the times I get when running the first code snippet with different compilation options:
10.62 sec ... None
10.35 sec ... -mtune=native -mavx
7.44 sec ... -mtune-native -mavx -ffast-math
7.49 sec ... -mtune-native -mavx -funsafe-math-optimizations
Finally, with these same optimizations, I am able to vectorize tsum = sum(A(i1:i2,ir))
to get
7.96 sec ... None
8.41 sec ... -mtune=native -mavx
5.06 sec ... -mtune=native -mavx -ffast-math
4.97 sec ... -mtune=native -mavx -funsafe-math-optimizations
When we compare sum
and -mtune=native -mavx
with -mtune=native -mavx -funsafe-math-optimizations
, it shows a ~70% speedup. (Note that these were only run once each - before we publish we will do true benchmarking on multiple runs).
I do take a small hit though. My values change slightly when I use the -f
options. Without them, the errors for my variables (v1
, v2
) are :
v1 ... 5.60663e-15 9.71445e-17 1.05471e-15
v2 ... 5.11674e-14 1.79301e-14 2.58127e-15
but with the optimizations, the errors are :
v1 ... 7.11931e-15 5.39846e-15 3.33067e-16
v2 ... 1.97273e-13 6.98608e-14 2.17742e-14
which indicates that there truly is something different going on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With