Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache POI not returning the proper value for large numbers coming from Excel

Tags:

apache-poi

I have an excel file with the value 6228480018362050000 the exported csv looks like this...

Int,Bigint,String
1,6228480018362050000,Very big

When I try running the following code...

InputStream inp = new FileInputStream("/.../test.xlsx");
DataFormatter df = new DataFormatter(true);
df.formatCellValue(WorkbookFactory.create(inp).getSheetAt(0).getRow(1).getCell(1));

I get 6228480018362049500 which is the wrong number because precision is hosed. Is there a way to get the actual value?

like image 820
Jackie Avatar asked Dec 04 '22 23:12

Jackie


2 Answers

There is no change (loss or gain) of precision between 6228480018362050000 and 6228480018362049500. They are simply two different decimal presentations of the same internal binary value, which in decimal is exactly 6228480018362049536, by the way.

Regardless of the cell format, Excel displays (not "stores") only up to the first 15 significant digits, rounding any digits to the right [1].

However, other applications and file formats show up to the first 17 significant digits (or more), which is really what the IEEE 754 standard requires in order to represent every binary value [2]. Apparently, that is true of Apache POI and OpenXML.

You can demonstrate this by doing the following.

  1. In Excel, enter 6228480018362050000. Save as XML.

  2. Open the XML file in Notepad. Note that the Cell/Data element shows 6.2284800183620495E+18, which is 6228480018362049500.

  3. Open the XML file in Excel. Note that Excel still displays 6228480018362050000 in the Formula Bar and in the cell formatted as Number.

It is true that Excel truncates manually-entered numbers (including those read from CSV and TXT files) to the first 15 significant digits, replacing any digits to the right with zeros. But Excel VBA does not.

So for another demonstration, enter the following in VBA, then execute the procedure.

Sub doit()
Range("a1:a2").NumberFormat = "0"
Range("a1") = CDbl("6228480018362050000")
Range("a2") = CDbl("6228480018362049536")
Columns("a").AutoFit
Range("b2") = "=match(a1,a2,0)"
End Sub

Note that A1 and A2 display 6228480018362050000. B2 displays 1, indicating that the internal binary values are an exact match, and VBA does not truncate after the first 15 significant digits.


Explanation....

Excel and most applications use IEEE 754 double-precision to represent numeric values. The binary representation is the sum of 53 consecutive powers of 2 ("bits") times an exponential factor.

Consequently, only integers up to 9007199254740992 (2^53) can be represented exactly. (But note that Excel displays 9007199254740990 for =2^53 because of its 15-significant-digit formatting limitation.)

Most larger integers can only be approximated.

And that is true of most decimal fractions as well, regardless of the number of significant digits. That is part of the reason why =10.1-10 displays 0.0999999999999996 in the Formula Bar and in the cell formatted with 16 decimal places (15 significant digits).


But beware: a calculated value that displays as 6228480018362050000 might differ from the actual internal binary value.

For example, if you enter 6228480018362050000 into A1 and the formula =6228480018362050000+1600 into A2, both A1 and A2 display 6228480018362050000.

But =MATCH(A1,A2,0) returns #N/A, which indicates that the internal binary values are not an exact match.

And the XML file would show 6.2284800183620516E+18 in the Data element corresponding to the Cell element for A2, which is 6228480018362051600. The actual internal binary value, in decimal, is exactly 6228480018362051584.

(FYI, the Excel equal operator ("=") does not compare the internal binary values. Instead, it compares the values rounded to 15 significant digits. So =(A1=A2) returns TRUE misleadingly. It is intended to be a feature; but it is implemented inconsistently.)

If you copy A2 and paste-value into A3, =MATCH(A1,A3,0) continues to return #N/A. But if you subsequently "edit" A3 (e.g. press f2, then Enter), =MATCH(A1,A3,0) returns 1. The internal value of A3 has been changed to the binary representation of 6228480018362050000.

I wonder if that is actually the mysterious problem that you encountered, and you inadvertently oversimplified it with your example.

Does that help?


[1] Cell format does not affect the internal binary value with two exceptions: (1) when Precision As Displayed is set, which is almost never recommended; and (2) when the cell value is calculated, and the worksheet is saved in CSV or TXT file, then re-open or imported in Excel.

[2] Although IEEE 754 specifies that 17 significant decimal digits are the minimum needed to represent all binary values, that does not mean that only 17 significant decimal digits are "stored". As demonstrated above, 6228480018362049500 is actually stored as exactly 6228480018362049536.

like image 30
joeu2004.5 Avatar answered Dec 11 '22 17:12

joeu2004.5


If we put long numbers into Excel cells, then those numbers will be truncated to 15 significant digits. This is because Excel does not know such things like big integers. It has only floating point to store numeric values. And with those it follows the IEEE 754 specification. But some numbers cannot be stored as floating point numbers according to the IEEE 754 specification. With your example the 6228480018362050000, which is 6.22848001836205E+018, cannot be stored as such. It will be 6.2284800183620495E+018 or 6228480018362049500 according to IEEE 754 specification.

Microsoft's knowledge base mentions: "Excel follows the IEEE 754 specification on how to store and calculate floating-point numbers. Excel therefore stores only 15 significant digits in a number, and changes digits after the fifteenth place to zeroes."

This is not the whole truth. In reality at least with Office OpenXML (*.xlsx) it stores the values according to IEEE 754 specification and not only 15 significant digits. With your example it stores <v>6.2284800183620495E+18</v>. But thats secondary. Because even if it would store 6.22848001836205E+018, somewhere this must be reconverted to floating point and then it will be 6.2284800183620495E+18 again. Excel does the same while opening the workbook. It converts <v>6.2284800183620495E+18</v> to floating point and then it only displays 15 significant digits.

So if you really need to store the 6228480018362050000 as a number in Excel, then the only way to get the same results as in Excel is to do the same as Excel. To do so we can use BigDecimal and it's round method which is able to use a MathContext with setted precision.

Example:

import org.apache.poi.ss.usermodel.*;

import java.io.*;

import java.math.BigDecimal;
import java.math.MathContext;

class ReadExcelBigNumbers {

 public static void main(String[] args) throws Exception{

  for (int i = 0; i < 10; i++) {
   String v = "6.2284800183620" + i + "E+018";
   double d = Double.parseDouble(v);
   System.out.print(v + "\t");
   System.out.print(d + "\t");
   BigDecimal bd = new BigDecimal(d);
   v = bd.round(new MathContext(15)).toPlainString();
   System.out.println(v);
  }

  InputStream inp = new FileInputStream("test.xlsx");
  Workbook wb = WorkbookFactory.create(inp);
  for (int i = 1; i < 9; i++) {
   double d = wb.getSheetAt(0).getRow(i).getCell(1).getNumericCellValue();
   BigDecimal bd = new BigDecimal(d);
   String v = bd.round(new MathContext(15)).toPlainString();
   System.out.println(v);
  }
 }
}

The first part prints:

6.22848001836200E+018   6.2284800183620004E18   6228480018362000000
6.22848001836201E+018   6.2284800183620096E18   6228480018362010000
6.22848001836202E+018   6.2284800183620198E18   6228480018362020000
6.22848001836203E+018   6.2284800183620301E18   6228480018362030000
6.22848001836204E+018   6.2284800183620403E18   6228480018362040000
6.22848001836205E+018   6.2284800183620495E18   6228480018362050000
6.22848001836206E+018   6.2284800183620598E18   6228480018362060000
6.22848001836207E+018   6.22848001836207E18     6228480018362070000
6.22848001836208E+018   6.2284800183620803E18   6228480018362080000
6.22848001836209E+018   6.2284800183620905E18   6228480018362090000

There you can see the difference between wanted floating point value, real floating point value according IEEE 754 specification and reformatted BigDecimal. As you see only the 6.22848001836207E+018 can be stored according to the IEEE 754 specification directly.

The second part does the same using the following Excel sheet:

enter image description here

Another possible workaround is mentioned in the knowledge base article : "To work around this behavior, format the cell as text, then type the numbers. The cell can then display up to 1,024 characters. ". This is good if the numbers are not really numbers but Identifiers for example or some other strings where the digits are only meant as characters. Calculations with such "Text-Numbers" are of course not possible without reconverting them to floating point which will bring the problem again.

like image 96
Axel Richter Avatar answered Dec 11 '22 15:12

Axel Richter