Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function

f(x) = sqrt[(x^2)+1]-1

that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.

For very large numbers I simplify the expression to:

f(x) = x-1

By just using the highest power. This was the correct answer.

The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.

As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:

#include <math.h>
#include <stdio.h>

double feval(double x) {
    /* Insert your code here */
    if (x > 1e299) 
    {;
        return x-1;
    }
    if (x < 1e-6)
    {
        long double g;
        g = x;
        printf("x = %Lf\n", g);
        long double a;
        a = pow(x,2);
        printf("x squared = %Lf\n", a);
        return sqrt(g*g+1.)- 1.;
    }
    else
    { 
        printf("x = %f\n", x);
        printf("Used third \n");
        return sqrt(pow(x,2)+1.)-1;
    }
}

int main(void)
{
    double x;
    printf("Input: ");
    scanf("%lf", &x);
    double b;
    b = feval(x);
    printf("%f\n", b);
    return 0;
}
like image 491
lcleary Avatar asked Dec 22 '22 15:12

lcleary


2 Answers

For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.

It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be

sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)

Then, you could write your result as

sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),

avoiding the scenario where you add a very small number to 1.

As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.

like image 176
johnmastroberti Avatar answered Jan 10 '23 21:01

johnmastroberti


There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.

ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.

Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.

/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
    return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
like image 43
njuffa Avatar answered Jan 10 '23 22:01

njuffa