How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? I am about: <code>pow(x, y) = exp(y*log(x))</code> I.e. do both <code>exp()</code> and <code>log()</code> AVX x86_64 instructions require certain known number of cycles? <ul> <li>exp(): <code>_mm256_exp_ps()</code> </li> <li>log(): <code>_mm256_log_ps()</code> </li> </ul> Or the number of cycles may vary depending on the exponential degree, is there the maximum number of cycles can cost exponentiation?

The x86 SIMD instruction set (i.e. not x87), at least up to AVX2, does not include SIMD <code>exp</code>, <code>log</code>, or <code>pow</code> with the exception of <code>pow(x,0.5)</code> which is the square root. There are SIMD math libraries however which are built from SIMD instructions which have these functions (among others). Intel's SVML includes: <pre class="prettyprint"><code>__m256 _mm256_exp_ps(__m256) __m256 _mm256_log_ps(__m256) __m256 _mm256_pow_ps(__m256, __m256) </code></pre> which Intel disingenuously calls intrinsics when they are in fact functions with several instructions. SVML is closed source and expensive. However, by searching for svml after installing the Intel OpenCL runtime I found some svml files in the OpenCL directories so I think you can get SVML indirectly through Intel's OpenCL runtime. AMD also provides a SIMD math library called LibM, which is closed source but free, which also has its own SIMD math functions: <pre class="prettyprint"><code>__m128 amd_vrs4_expf(__m128) __m128 amd_vrs4_logf(__m128) __m128 amd_vrs4_powf(__m128, __m128) </code></pre> Agner Fog's Vector Class Library provides an interface to SVML and LibM. See the file <code>vectormath_lib.h</code>. From this you can figure out the corresponding functions from SVML and LibM. Agner also provides his own code for these functions which he claims is competitive with the proprietary Intel and AMD version. For Agner's version of the functions look in <code>vectormath_exp.h</code> e.g. look at <code>exp_f</code>, <code>log_f</code>, and <code>pow_template_f</code> and then look at the generated assembly. You can use SVML, LibM, and Agner's own functions to time the <code>exp</code> and <code>log</code> functions. However, you should know that SVML and LibM don't play well on the others hardware. AMD for example is optimized for FMA4 which Intel does not have (but Intel original planned to have FMA4 and then changed to FMA3 suddenly after AMD had already planned for FMA4). Intel appears to do something ummm...well I suggest you read about it. So if you time SVML or LibM on AMD or Intel processors respectively you will likely get very different results in performance (unless you manage to replace Intel's CPU dispatch function). Unlike GPUs the x86 instructions set is publicly available so you can build your own <code>exp</code> and <code>log</code> functions and that is what Agner has done. <hr> Update Glibc 2.22 (which should come out soon) has a vector math library called <code>libmvec</code>. Apparently it's enabled starting at <code>-O1</code> along with <code>-ffast-math</code> and <code>-fopenmp</code>. I'm not sure why <code>fast-math</code> and OpenMP are necessary (particularly in the example below as associative math is not necessary) but it's great to finally have a SIMD math library in the GNU C standard library. <pre class="prettyprint"><code>//gcc ./cos.c -O1 -fopenmp -ffast-math -lm -mavx2 #include <math.h> int N = 3200; double b[3200]; double a[3200]; int main (void) { int i; #pragma omp simd for (i = 0; i < N; i += 1) { b[i] = cos (a[i]); } return (0); } </code></pre>

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

1 Answers

The x86 SIMD instruction set (i.e. not x87), at least up to AVX2, does not include SIMD exp, log, or pow with the exception of pow(x,0.5) which is the square root.

There are SIMD math libraries however which are built from SIMD instructions which have these functions (among others). Intel's SVML includes:

__m256 _mm256_exp_ps(__m256)
__m256 _mm256_log_ps(__m256)
__m256 _mm256_pow_ps(__m256, __m256)

which Intel disingenuously calls intrinsics when they are in fact functions with several instructions. SVML is closed source and expensive. However, by searching for svml after installing the Intel OpenCL runtime I found some svml files in the OpenCL directories so I think you can get SVML indirectly through Intel's OpenCL runtime.

AMD also provides a SIMD math library called LibM, which is closed source but free, which also has its own SIMD math functions:

__m128 amd_vrs4_expf(__m128)
__m128 amd_vrs4_logf(__m128)
__m128 amd_vrs4_powf(__m128, __m128)

Agner Fog's Vector Class Library provides an interface to SVML and LibM. See the file vectormath_lib.h. From this you can figure out the corresponding functions from SVML and LibM.

Agner also provides his own code for these functions which he claims is competitive with the proprietary Intel and AMD version. For Agner's version of the functions look in vectormath_exp.h e.g. look at exp_f, log_f, and pow_template_f and then look at the generated assembly.

You can use SVML, LibM, and Agner's own functions to time the exp and log functions. However, you should know that SVML and LibM don't play well on the others hardware. AMD for example is optimized for FMA4 which Intel does not have (but Intel original planned to have FMA4 and then changed to FMA3 suddenly after AMD had already planned for FMA4). Intel appears to do something ummm...well I suggest you read about it.

So if you time SVML or LibM on AMD or Intel processors respectively you will likely get very different results in performance (unless you manage to replace Intel's CPU dispatch function). Unlike GPUs the x86 instructions set is publicly available so you can build your own exp and log functions and that is what Agner has done.

Update

Glibc 2.22 (which should come out soon) has a vector math library called libmvec. Apparently it's enabled starting at -O1 along with -ffast-math and -fopenmp. I'm not sure why fast-math and OpenMP are necessary (particularly in the example below as associative math is not necessary) but it's great to finally have a SIMD math library in the GNU C standard library.

//gcc ./cos.c -O1 -fopenmp -ffast-math -lm -mavx2 
#include <math.h>

int N = 3200;
double b[3200];
double a[3200];

int main (void)
{
  int i;

  #pragma omp simd
  for (i = 0; i < N; i += 1)
  {
    b[i] = cos (a[i]);
  }

  return (0);
}

answered Nov 14 '22 23:11

Z boson

Related questions
                            
                                Why does the number of vt and v elements in a blender .obj file differ?
                            
                                boost::read_graphviz - how to read out properties?
                            
                                How to construct a vector with unique pointers
                            
                                C++ std::set and std::multiset
                            
                                Unpack ts... to t0.a(), t0.b(), t1.a(), t1.b(),
                            
                                Is std::async guaranteed to be called for functions returning void?
                            
                                Template Specialization for T -> std::vector<T>
                            
                                Specifying a Window Procedure for child Windows
                            
                                Spurious copies in c++03 libstdc++ vs c++11
                            
                                Internal vs External Include Guards
                            
                                Can storage for references inside a C++ class be optimized away?
                            
                                Convert boost::container::boost basic_string to std::string
                            
                                Why does std::thread take function to run by rvalue?
                            
                                Function method definition in .cpp vs .h
                            
                                Delegating constructor issue - Is it safe?
                            
                                Mapping signed integer ranges to unsigned
                            
                                execution time in multithreading environment
                            
                                const or ref or const ref or value as an argument of setter function
                            
                                Template alias for another alias [duplicate]
                            
                                FastCGI or HTTP server for C++ daemon behind nginx proxy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

Tags:

c++

x86

x86-64

avx

sse

Alex

People also ask

1 Answers

Z boson

Recent Activity

Donate For Us