Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

Tags:

c++

x86

x86-64

avx

sse

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

I am about: pow(x, y) = exp(y*log(x))

I.e. do both exp() and log() AVX x86_64 instructions require certain known number of cycles?

  • exp(): _mm256_exp_ps()
  • log(): _mm256_log_ps()

Or the number of cycles may vary depending on the exponential degree, is there the maximum number of cycles can cost exponentiation?

like image 782
Alex Avatar asked Jul 19 '15 14:07

Alex


People also ask

How many CPU cycles does add take?

Because assembly instruction ADD only takes 1-2 CPU cycles.

What is a clock cycle?

In a computer, the clock cycle is the time between two adjacent pulses of the oscillator that sets the tempo of the computer processor.


1 Answers

The x86 SIMD instruction set (i.e. not x87), at least up to AVX2, does not include SIMD exp, log, or pow with the exception of pow(x,0.5) which is the square root.

There are SIMD math libraries however which are built from SIMD instructions which have these functions (among others). Intel's SVML includes:

__m256 _mm256_exp_ps(__m256)
__m256 _mm256_log_ps(__m256)
__m256 _mm256_pow_ps(__m256, __m256)

which Intel disingenuously calls intrinsics when they are in fact functions with several instructions. SVML is closed source and expensive. However, by searching for svml after installing the Intel OpenCL runtime I found some svml files in the OpenCL directories so I think you can get SVML indirectly through Intel's OpenCL runtime.

AMD also provides a SIMD math library called LibM, which is closed source but free, which also has its own SIMD math functions:

__m128 amd_vrs4_expf(__m128)
__m128 amd_vrs4_logf(__m128)
__m128 amd_vrs4_powf(__m128, __m128)

Agner Fog's Vector Class Library provides an interface to SVML and LibM. See the file vectormath_lib.h. From this you can figure out the corresponding functions from SVML and LibM.

Agner also provides his own code for these functions which he claims is competitive with the proprietary Intel and AMD version. For Agner's version of the functions look in vectormath_exp.h e.g. look at exp_f, log_f, and pow_template_f and then look at the generated assembly.

You can use SVML, LibM, and Agner's own functions to time the exp and log functions. However, you should know that SVML and LibM don't play well on the others hardware. AMD for example is optimized for FMA4 which Intel does not have (but Intel original planned to have FMA4 and then changed to FMA3 suddenly after AMD had already planned for FMA4). Intel appears to do something ummm...well I suggest you read about it.

So if you time SVML or LibM on AMD or Intel processors respectively you will likely get very different results in performance (unless you manage to replace Intel's CPU dispatch function). Unlike GPUs the x86 instructions set is publicly available so you can build your own exp and log functions and that is what Agner has done.


Update

Glibc 2.22 (which should come out soon) has a vector math library called libmvec. Apparently it's enabled starting at -O1 along with -ffast-math and -fopenmp. I'm not sure why fast-math and OpenMP are necessary (particularly in the example below as associative math is not necessary) but it's great to finally have a SIMD math library in the GNU C standard library.

//gcc ./cos.c -O1 -fopenmp -ffast-math -lm -mavx2 
#include <math.h>

int N = 3200;
double b[3200];
double a[3200];

int main (void)
{
  int i;

  #pragma omp simd
  for (i = 0; i < N; i += 1)
  {
    b[i] = cos (a[i]);
  }

  return (0);
}
like image 59
Z boson Avatar answered Nov 14 '22 23:11

Z boson