I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.
I came across this great answer:
/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
__m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
__m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
__m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
return _mm_castsi128_ps (t);
}
Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.
Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x)
.
This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.
What changes would be needed?
I naively tried this:
__m128i double_to_uint64(__m128d x) {
x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
return _mm_xor_si128(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
);
}
__m128d FastExpSseDouble(__m128d x) {
#define S 52
#define C (1llu << S) / log(2)
__m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
__m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);
auto y = double_to_uint64(_mm_mul_pd(a, x));
__m128i t = _mm_add_epi64(y, b);
return _mm_castsi128_pd(t);
}
Of course this returns garbage as I don't know what I'm doing...
About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).
Here is the code I used:
// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements
const auto I = v.size();
const auto N = (I / 4) * 4;
for (int n = 0; n < N; n += 4) {
float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };
__m128 x;
x = _mm_load_ps(a);
auto r = FastExpSse(x);
_mm_store_ps(a, r);
v[n] = a[0];
v[n + 1] = a[1];
v[n + 2] = a[2];
v[n + 3] = a[3];
}
for (int n = N; n < I; ++n) {
v[n] = FastExp(v[n]);
}
}
And here is what I would do if I had this "double precision" version:
void FastExpSseVectorDouble(std::vector<double> & v) {
const auto I = v.size();
const auto N = (I / 2) * 2;
for (int n = 0; n < N; n += 2) {
__m128d x;
x = _mm_load_pd(&v[n]);
auto r = FastExpSseDouble(x);
_mm_store_pd(&v[n], r);
}
for (int n = N; n < I; ++n) {
v[n] = FastExp(v[n]);
}
}
Something like this should do the job. You need to tune the 1.05
constant to get a lower maximal error -- I'm too lazy to do that:
__m128d fastexp(const __m128d &x)
{
__m128d scaled = _mm_add_pd(_mm_mul_pd(x, _mm_set1_pd(1.0/std::log(2.0)) ), _mm_set1_pd(3*1024.0-1.05));
return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(scaled), 11));
}
This just gets about 2.5% relative precision -- for better precision you may need to add a second term.
Also, for values which overflow or underflow this will result in unspecified values, you can avoid this by clamping the scaled
value to some values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With