I worked mostly with integers before, and in situations where I need to truncate a float or double to an integer, I would use the following before:
(int) someValue
except until I found out the following:
NSLog(@"%i", (int) ((1.2 - 1) * 10)); // prints 1
NSLog(@"%i", (int) ((1.2f - 1) * 10)); // prints 2
(please see Strange behavior when casting a float to int in C# for the explanation).
The short question is: how should we truncate a float or double to an integer properly? (Truncation is wanted in this case, not "rounding"). Or, we may say that since one number is 1.9999999999999 and the other is 2.00000000000001 (roughly speaking), the truncate is actually done correctly. So the question is, how should we convert a float or double so that the result is a "truncated" number that makes common usage sense?
(the intention is not to use round
, because in this case, for 1.8
, we do want the result of 1
, instead of 2
)
Longer question:
I used
int truncateToInteger(double a) {
return (int) (a + 0.000000000001);
}
-(void) someTest {
NSLog(@"%i", truncateToInteger((1.2 - 1) * 10));
NSLog(@"%i", truncateToInteger((1.2f - 1) * 10));
}
and both print out as 2
, but it seems too much of a hack, and what small number should we use to "remove the inaccuracy"? Is there a more standard or studied way, instead of such an arbitrary hack?
(Note that we want truncation, not rounding in some usage, for example, say, if the number of seconds is 90 or 118, when we show how many minutes and how many seconds have elapsed, the minute should display as 1
, but should not be rounded up to 2
)
It truncates automatically is you assign value to "int" variable: int c; c = a/b; Or you can cast like this: c = (int) (a/b);
Use the int Function to Truncate a Float in Python The built-in int() function takes a float and converts it to an integer, thereby truncating a float value by removing its decimal places. What is this? The int() function works differently than the round() and floor() function (which you can learn more about here).
Use lroundf() to round a float to integer and then convert the integer to a string.
The truncate has been performed correctly, of course, but on an inaccurate intermediate value.
In general there's no way to know whether your 1.999999
result is a slightly inaccurate 2
(so the exact-maths result after truncation is 2
), or a slightly inaccurate 1.999998
(so the exact-maths result after truncation is 1
).
For that matter, for some calculations you could get 2.000001
as a slightly inaccurate 1.999998
. Pretty much whatever you do, you'll get that one wrong. Truncation is a non-continuous function, so however you do it, it makes your overall computation numerically unstable.
You could add an arbitrary tolerance anyway: (int)(x > 0 ? x + epsilon : x - epsilon)
. It may or my not help, depending what you're doing, which is why it's a "hack". epsilon
could be a constant, or it could scale according to the size of x
.
The most common solution to your second question isn't to "remove the inaccuracy", rather to accept the inaccurate result as if it were accurate. So, if your floating point unit says that (1.2-1)*10
is 1.999999, OK, it is 1.999999. If that value represents a number of minutes then it truncates to 1 minute 59 seconds. Your final displayed result will be 1s off the true value. If you need a more accurate final displayed result than that, then you shouldn't have used floating-point arithmetic to compute it, or perhaps you should have rounded to the nearest second before truncating to minutes.
Any attempt to "remove" inaccuracy from a floating-point number is actually just going to move inaccuracy around - some inputs will give more accurate results, others less accurate. If you're lucky enough to be in a case where the the inaccuracy is shifted to inputs you don't care about, or can filter out before doing the computation, then you win. In general though, if you have to accept any input then you're going to lose somewhere. You need to look at how to make your computation more accurate, rather than trying to remove inaccuracy in a truncation step at the end.
There's a simple correction for your example computation -- use fixed-point arithmetic with one base-10 decimal place. We know that format can accurately represent 1.2. So, instead of writing (1.2 - 1) * 10
, you should rescale the computation to use tenths (write (12 - 10) * 10
) and then divide the final result by 10 to scale it back to units.
As you have modified your question, the problem now seems to be this: Given some inputs x, you calculate a value f'(x). f'(x) is the calculated approximation to an exact mathematical function f(x). You want to calculate trunc(f(x)), that is, the integer i that is farthest from zero without being farther from zero than f(x) is. Because f'(x) has some error, trunc(f'(x)) might not equal trunc(f(x)), such as when f(x) is 2 but f'(x) is 0x1.fffffffffffffp0. Given f'(x), how can you calculate trunc(f(x))?
This problem is impossible to solve. There is no solution that will work for all f.
The reason there is no solution is that, due to the error in f', f'(x) might be 0x1.fffffffffffffp0 because f(x) is 0x1.fffffffffffffp0, or f'(x) might be 0x1.fffffffffffffp0 because of calculation errors even though f(x) is 2. Therefore, given a particular value of f'(x), it is impossible to know what trunc(f(x)) is.
A solution is possible only given detailed information about f (and the actual operations used to approximate it with f'). You have not given that information, so your question cannot be answered.
Here is a hypothesis: Suppose the nature of f(x) is such that its results are always a non-negative multiple of q, for some q that divides 1. For example, q might be .01 (hundredths of a coordinate value) or 1/60 (represent units of seconds because f is in units of minutes). And suppose the values and operations used in calculating f' are such that the error in f' is always less than q/2.
In this very limited, and hypothetical, case, then trunc(f(x)) can be calculated by calculating trunc(f'(x)+q/2). Proof: Let i = trunc(f(x)). Suppose i > 0. Then i <= f(x) < i+1, so i <= f(x) <= i+1-q (because f(x) is quantized by q). Then i-q/2 < f'(x) < i+1-q+q/2 (because f'(x) is within q/2 of f(x)). Then i < f'(x)+q/2 < i+1. Then trunc(f'(x)+q/2) = i, so we have the desired result. In the case where i = 0, then -1 < f(x) < 1, so -1+q <= f(x) <= 1-q, so -1+q-q/2 < f'(x) < 1-q+q/2, so -1+q < f'(x)+q/2 < 1, so trunc(f'(x)+q/2) = 0.
(Note: If q/2 is not exactly representable in the floating-point precision used or cannot be easily added to f'(x) without error, then some adjustments have to be made in either the proof, its conditions, or the addition of q/2.)
If that case does not serve your purpose, then you cannot expect an answer expect by providing detailed information about f and the operations and values used to calculate f'.
The 'hack' is the proper way to do it. It's simple how floats work, if you want more sane decimal behavior NSDecimal(Number)
might be what you want.
NSLog(@"%i", [[NSNumber numberWithFloat:((1.2 - 1) * 10)] intValue]); //2
NSLog(@"%i", [[NSNumber numberWithFloat:(((1.2f - 1) * 10))] intValue]); //2
NSLog(@"%i", [[NSNumber numberWithFloat:1.8] intValue]); //1
NSLog(@"%i", [[NSNumber numberWithFloat:1.8f] intValue]); //1
NSLog(@"%i", [[NSNumber numberWithDouble:2.0000000000001 ] intValue]);//2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With