In numerical computation, it is often needed to scale numbers to be in safe range.
For example, computing Euclidean distance: sqrt(a^2+b^2)
. Here, if magnitude of a
or b
is too small/large, then underflow/overflow can happen.
A common approach to solve this is to divide numbers by the largest magnitude number. However, this solution is:
So I thought that instead of dividing by the largest magnitude number, let's multiply it with a close power-of-2 reciprocal number. This seems a better solution, as:
So, I'd like to create a small utility function, which has a logic like this (by ^
, I mean exponentiation):
void getScaler(double value, double &scaler, double &scalerReciprocal) {
int e = <exponent of value>;
if (e<-1022) { scaler=2^-1022; scalerReciprocal = 2^1022; }
} else if (e>1022) { scaler=2^1022; scalerReciprocal = 2^-1022; }
} else { scaler=2^e; scalerReciprocal = 2^(2046-e); }
}
This function should return a normalized scaler
& scalerReciprocal
, both are power-of-2 numbers, where scaler
is near to value
, and scalerReciprocal
is the reciprocal of scaler
.
The maximum allowed exponents for scaler
/scaleReciprocal
are -1022..1022
(I don't want to work with subnormal scaler
, as subnormal numbers can be slow).
What would be a fast way to do this? Can this be done with pure floating-point operations? Or should I extract the exponent from value
, and use simple if
s to do the logic? Is there some kind of trick to do the comparison with (-)1022 fast (as the range is symmetric)?
Note: scaler
doesn't need to be the closest possible power-of-2. If some logic needs it, scaler
can be some small power-of-2 away from the closest value.
Finding previous power of 2 If we somehow, can unset all bits except the left most bit in a binary representation of number we will get previous power of two of the given number. We will use Bitwise AND ( & ) operation to clear bits.
To check if a given number is a power of 2, we can continuously divide the number by 2, on the condition that the given number is even. After the last possible division, if the value of the number is equal to 1, it is a power of 2. Otherwise, it is not.
Floating-point decimal values generally do not have an exact binary representation. This is a side effect of how the CPU represents floating point data. For this reason, you may experience some loss of precision, and some floating-point operations may produce unexpected results.
Function s = get_scale(z)
computes the "close power of 2". Since the fraction bits of s
are zero, the inverse of s
is just an (inexpensive) integer subtraction: see function inv_of_scale
.
On x86 get_scale
and inv_of_scale
compile to quite efficient assembly with clang.
Compiler clang translates the ternary operators to minsd
and maxsd
,
see also Peter Cordes' comment.
With gcc, it is slightly more efficient to
translate these functions to x86 intrinsics
code (get_scale_x86
and inv_of_scale_x86
), see Godbolt.
Note that C explicitly permits type-punning
through a union, whereas C++ (c++11) has no such permission
Although gcc 8.2 and clang 7.0 do not complain about the union, you can improve
the C++ portabily by using the memcpy
trick instead of the
union trick. Such a modification of the code should be trivial.
The code should handle subnormals correctly.
#include<stdio.h>
#include<stdint.h>
#include<immintrin.h>
/* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */
union dbl_int64{
double d;
uint64_t i;
};
double get_scale(double t){
union dbl_int64 x;
union dbl_int64 x_min;
union dbl_int64 x_max;
uint64_t mask_i;
/* 0xFEDCBA9876543210 */
x_min.i = 0x0010000000000000ull;
x_max.i = 0x7FD0000000000000ull;
mask_i = 0x7FF0000000000000ull;
x.d = t;
x.i = x.i & mask_i; /* Set fraction bits to zero, take absolute value */
x.d = (x.d < x_min.d) ? x_min.d : x.d; /* If subnormal: set exponent to 1 */
x.d = (x.d > x_max.d) ? x_max.d : x.d; /* If exponent is very large: set exponent to 7FD, otherwise the inverse is a subnormal */
return x.d;
}
double get_scale_x86(double t){
__m128d x = _mm_set_sd(t);
__m128d x_min = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
__m128d x_max = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
__m128d mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7FF0000000000000ull));
x = _mm_and_pd(x, mask);
x = _mm_max_sd(x, x_min);
x = _mm_min_sd(x, x_max);
return _mm_cvtsd_f64(x);
}
/* Compute the inverse 1/t of a double t with all zero fraction bits */
/* and exponent between the limits of function get_scale */
/* A single integer subtraction is much less expensive than a */
/* floating point division. */
double inv_of_scale(double t){
union dbl_int64 x;
/* 0xFEDCBA9876543210 */
uint64_t inv_mask = 0x7FE0000000000000ull;
x.d = t;
x.i = inv_mask - x.i;
return x.d;
}
double inv_of_scale_x86(double t){
__m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
__m128d x = _mm_set_sd(t);
__m128i x_i = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
}
int main(){
int n = 14;
int i;
/* Several example values, 4.94e-324 is the smallest subnormal */
double y[14] = { 4.94e-324, 1.1e-320, 1.1e-300, 1.1e-5, 0.7, 1.7, 123.1, 1.1e300,
1.79e308, -1.1e-320, -0.7, -1.7, -123.1, -1.1e307};
double z, s, u;
printf("Portable code:\n");
printf(" x pow_of_2 inverse pow2*inv x*inverse \n");
for (i = 0; i < n; i++){
z = y[i];
s = get_scale(z);
u = inv_of_scale(s);
printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
}
printf("\nx86 specific SSE code:\n");
printf(" x pow_of_2 inverse pow2*inv x*inverse \n");
for (i = 0; i < n; i++){
z = y[i];
s = get_scale_x86(z);
u = inv_of_scale_x86(s);
printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
}
return 0;
}
The output looks fine:
Portable code:
x pow_of_2 inverse pow2*inv x*inverse
4.940656e-324 2.225074e-308 4.494233e+307 1.000000e+00 2.220446e-16
1.099790e-320 2.225074e-308 4.494233e+307 1.000000e+00 4.942713e-13
1.100000e-300 7.466109e-301 1.339386e+300 1.000000e+00 1.473324e+00
1.100000e-05 7.629395e-06 1.310720e+05 1.000000e+00 1.441792e+00
7.000000e-01 5.000000e-01 2.000000e+00 1.000000e+00 1.400000e+00
1.700000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.700000e+00
1.231000e+02 6.400000e+01 1.562500e-02 1.000000e+00 1.923437e+00
1.100000e+300 6.696929e+299 1.493222e-300 1.000000e+00 1.642544e+00
1.790000e+308 4.494233e+307 2.225074e-308 1.000000e+00 3.982882e+00
-1.099790e-320 2.225074e-308 4.494233e+307 1.000000e+00 -4.942713e-13
-7.000000e-01 5.000000e-01 2.000000e+00 1.000000e+00 -1.400000e+00
-1.700000e+00 1.000000e+00 1.000000e+00 1.000000e+00 -1.700000e+00
-1.231000e+02 6.400000e+01 1.562500e-02 1.000000e+00 -1.923437e+00
-1.100000e+307 5.617791e+306 1.780059e-307 1.000000e+00 -1.958065e+00
x86 specific SSE code:
x pow_of_2 inverse pow2*inv x*inverse
4.940656e-324 2.225074e-308 4.494233e+307 1.000000e+00 2.220446e-16
1.099790e-320 2.225074e-308 4.494233e+307 1.000000e+00 4.942713e-13
1.100000e-300 7.466109e-301 1.339386e+300 1.000000e+00 1.473324e+00
1.100000e-05 7.629395e-06 1.310720e+05 1.000000e+00 1.441792e+00
7.000000e-01 5.000000e-01 2.000000e+00 1.000000e+00 1.400000e+00
1.700000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.700000e+00
1.231000e+02 6.400000e+01 1.562500e-02 1.000000e+00 1.923437e+00
1.100000e+300 6.696929e+299 1.493222e-300 1.000000e+00 1.642544e+00
1.790000e+308 4.494233e+307 2.225074e-308 1.000000e+00 3.982882e+00
-1.099790e-320 2.225074e-308 4.494233e+307 1.000000e+00 -4.942713e-13
-7.000000e-01 5.000000e-01 2.000000e+00 1.000000e+00 -1.400000e+00
-1.700000e+00 1.000000e+00 1.000000e+00 1.000000e+00 -1.700000e+00
-1.231000e+02 6.400000e+01 1.562500e-02 1.000000e+00 -1.923437e+00
-1.100000e+307 5.617791e+306 1.780059e-307 1.000000e+00 -1.958065e+00
Vectorization
Function get_scale
should vectorize with compilers that support auto-vectorization. The following piece of
code vectorizes very well with clang (no need to write SSE/AVX intrinsics code).
/* Test how well get_scale vectorizes: */
void get_scale_vec(double * __restrict__ t, double * __restrict__ x){
int n = 1024;
int i;
for (i = 0; i < n; i++){
x[i] = get_scale(t[i]);
}
}
Unfortunately gcc doesn't find the vmaxpd
and vminpd
instructions.
Based on wim's answer, here's another solution, which can be faster, as it has one less instruction. The output is a little bit different, but still fulfills the requirements.
The idea is to use bit operations to fix border cases: put a 01
to the lsb of the exponent, no matter of its value. So, exponent:
00
to 01
), or halved (when 10->01) or 1/4 (when 11->01)So, this modified routine works (and I think that it's pretty cool that the problem can be solved with only 2 fast asm instructions):
#include<stdio.h>
#include<stdint.h>
#include<immintrin.h>
/* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */
union dbl_int64{
double d;
uint64_t i;
};
double get_scale(double t){
union dbl_int64 x;
uint64_t and_i;
uint64_t or_i;
/* 0xFEDCBA9876543210 */
and_i = 0x7FD0000000000000ull;
or_i = 0x0010000000000000ull;
x.d = t;
x.i = (x.i & and_i)|or_i; /* Set fraction bits to zero, take absolute value */
return x.d;
}
double get_scale_x86(double t){
__m128d x = _mm_set_sd(t);
__m128d x_and = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
__m128d x_or = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
x = _mm_and_pd(x, x_and);
x = _mm_or_pd(x, x_or);
return _mm_cvtsd_f64(x);
}
/* Compute the inverse 1/t of a double t with all zero fraction bits */
/* and exponent between the limits of function get_scale */
/* A single integer subtraction is much less expensive than a */
/* floating point division. */
double inv_of_scale(double t){
union dbl_int64 x;
/* 0xFEDCBA9876543210 */
uint64_t inv_mask = 0x7FE0000000000000ull;
x.d = t;
x.i = inv_mask - x.i;
return x.d;
}
double inv_of_scale_x86(double t){
__m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
__m128d x = _mm_set_sd(t);
__m128i x_i = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
}
int main(){
int n = 14;
int i;
/* Several example values, 4.94e-324 is the smallest subnormal */
double y[14] = { 4.94e-324, 1.1e-320, 1.1e-300, 1.1e-5, 0.7, 1.7, 123.1, 1.1e300,
1.79e308, -1.1e-320, -0.7, -1.7, -123.1, -1.1e307};
double z, s, u;
printf("Portable code:\n");
printf(" x pow_of_2 inverse pow2*inv x*inverse \n");
for (i = 0; i < n; i++){
z = y[i];
s = get_scale(z);
u = inv_of_scale(s);
printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
}
printf("\nx86 specific SSE code:\n");
printf(" x pow_of_2 inverse pow2*inv x*inverse \n");
for (i = 0; i < n; i++){
z = y[i];
s = get_scale_x86(z);
u = inv_of_scale_x86(s);
printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
}
return 0;
}
You can use
double frexp (double x, int* exp);
Returned value is the fractional part of of x and exp is the exponent (minus the offset).
Alternatively, the following code gets the exponent part of a double.
int get_exp(double *d) {
long long *l = (long long *) d;
return ((*l & (0x7ffLL << 52) )>> 52)-1023 ;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With