In my project I have to compute division, multiplication, subtraction, addition on a matrix of double
elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double
for each element which I believe uses 8 bytes
of memory & has accuracy of 16 digits
irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes
which require more memory.
So I wanted to know which data type is more precise than double
.
I tried searching in some books & I could find long double
.
But I dont know what is its precision.
And what if I want more precision than that?
In C and related programming languages, long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double .
The double data type has more precision as compared to the three other data types. This data type has more digits towards the right of decimal points as compared to other data types. For instance, the float data type contains six digits of precision whereas double data type comprises of fourteen digits.
Double is more precise than float and can store 64 bits, double of the number of bits float can store. Double is more precise and for storing large numbers, we prefer double over float. For example, to store the annual salary of the CEO of a company, double will be a more accurate choice.
Solution. A variable of type float only has 7 digits of precision whereas a variable of type double has 15 digits of precision. If you need better accuracy, use double instead of float.
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double
, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double
for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128
type and there is (if memory serves) a compiler option to set long double
to it.
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With