int64_t to double to int64_t again, loss of precision

Question

I need to parse a given type (eg: long long integer) which is represented with scientific notation. Examples:

123456789012345678.3e-3
123456789012345678.3

I know the type of the given string but I can't use strtoll since number is given in scientific notation. What I do is that I convert it using strtod, do error checks with respect to int64_t and cast it back to int64_t. ErrCheckInt and ErrCheckDouble does error checks (overflow, underflow, etc) for integral and floating types and casts the number into whatever type it was. .

double res = strtod(processedStr, &end);
return (std::is_floating_point<OUT_T>::value) ? ErrCheckFloat<double, OUT_T>(res, out) : ErrCheckInt<double, OUT_T>(res, out);

Problem is when I parse int64_t with double, I get a floating point number with correct scientific notation, 1 significand. When I cast the number to int64_t again, I loss precision. The example number:

input:             123456789012345678.3
double_converted:  1.23456789012346E+17
cast_to_int64_t:   123456789012345680
expected:          123456789012345678

I know that number is long enough to be represented correctly with double precision. I can use long double but that won't solve the problem.

I can evaluate the string and remove / add digits with respect to e notation in the end but processing should be very, very fast since code will run in embedded rtos. I already do a lot of checks and strtod will take its time as well.

Pascal Cuoq · Accepted Answer

I know the type of the given string but I can't use strtoll since number is given in scientific notation.

You only need to call it once, use the resulting pointer to detect whether the number is in xxxeyyy form, and call strtoll again to parse the exponent. Much simpler than going through floating-point in my opinion.

I know that number is long enough to be represented correctly with double precision.

No, you don't know that since your example input is “123456789012345678”, which is not representable in IEEE 754 double-precision.

I can use long double but that won't solve the problem.

Actually, if your compiler maps long double to “80-bit extended precision with 64 bit significand”, it will solve the problem: all 64-bit integers are representable in that format. GCC and Clang make the historical 80-bit floating-point format available through long double on Linux, but it is so inconvenient as to be practically considered not available on Windows (you would need to change the FPU control word, and restore it everytime you have library functions to call, and write your own math functions to operate on 80-bit floating-point values. Starting with strtold.

int64_t to double to int64_t again, loss of precision

Tags:

c++

type-conversion

floating-point

floating-accuracy

Halil Kaskavalci

1 Answers

Pascal Cuoq

Recent Activity

Donate For Us

int64_t to double to int64_t again, loss of precision

Tags:

c++

type-conversion

floating-point

floating-accuracy

Halil Kaskavalci

1 Answers

Pascal Cuoq

Related questions

Recent Activity

Donate For Us