Most of the BLAS Level 1 API can be trivially written straight forward using Fortran 9x+ vectorized assignments and intrinsic procedures.
Assuming you are using a modern optimizing compiler, like Intel Fortran, and correct target-specific compiler optimization options, are there any performance benefits from using BLAS Level 1 procedures instead, say from Intel MKL or other fast BLAS implementations?
If there are, what is a typical vector size when these benefits appear?
It depends. We've tested this before with the Intel compiler and run into surprising results. For example, DOT_PRODUCT
from Fortran vs. the BLAS implementation gave different trends based on the problem size. As the number of elements in the arrays got larger, BLAS became better than the intrinsic. But for small problem sizes, the intrinsic was much faster.
We actually measured for our use cases what the cut-off size that's required to make one better than the other and actually use if-statements to decide which to call. I can't share those results, but I encourage you to test it out yourself. There is still benefit from using BLAS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With