Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy.float64 changes when writing to Excel (.xlsx)

I've noticed that when certain Numpy float64 values are saved as Excel file (via a Pandas DataFrame), they get changed. First I thought this has to do with some imprecision in Excel, but Excel seems to encode floating point numbers as double precision, so I am a bit confused about this observation.

>>> import numpy as np
>>> import pandas as pd

# Create a floating point number that exhibits the problem.
>>> ba = bytearray(['\x53', '\x2a', '\xb0', '\x49', '\xf3', '\x79', '\x90', '\x40'])
>>> ba
bytearray(b'S*\xb0I\xf3y\x90@')
>>> f = np.frombuffer(ba)
>>> f[0]
1054.4875857854684

# Write to dataframe to save as Excel file.
>>> df = pd.DataFrame({'a': f})
>>> df.to_excel('test.xlsx', engine='xlsxwriter')

# Read excel file (when viewing the file in LibreOffice, the 
# value isn't 1054.4875857854684 any more).
>>> df2 = pd.read_excel('test.xlsx')
>>> df2.ix[0,'a']
1054.4875857854699
>>> df2.ix[0,'a'] == f[0]
False

Why is it not possible to read the same float64 back from Excel that was previously written?

I also tried this with Openpyxl (.xlsx format) and Xlwt (.xls format) as engines. While the former produced the same erroneous result as xlsxwriter, Xlwt was actually working as expected and wrote the float according to the exact variable value. Is there perhaps a parameter that I miss for the .xlsx format writer engines?

# this uses the xlwt engine
>>> df.to_excel('test.xls')
>>> df2 = pd.read_excel('test.xls')
>>> df2.ix[0,'a'] == f[0]
True
like image 394
orange Avatar asked Nov 07 '22 20:11

orange


1 Answers

I also tried this with Openpyxl (.xlsx format) and Xlwt (.xls format) as engines. While the former produced the same erroneous result as xlsxwriter, Xlwt was actually working as expected and wrote the float according to the exact variable value.

The difference is that .xls is a binary file format and the 64 bit representation of the the IEEE 754 double is written exactly to the file and can be read back to the same 64 bits.

The .xlsx file format however is a collection of text XML files in a zip container. As such doubles are written as a string representation of the double (using a format like '%.16g') and read in by converting that string representation back to a double. That is essentially a lossey process for doubles since there isn't an exact string representation for a vast majority of IEEE 754 numbers.

For example if you take the numpy number in your example and format it with different precisions you will get different representations:

>>> '%.16g' % f[0]
'1054.487585785468'

>>> '%.17g' % f[0]
'1054.4875857854684'

>>> '%.18g' % f[0]
'1054.48758578546835'

You can also demonstrate this yourself by pasting 1054.4875857854684 into a cell in Excel, saving the file and examining the output:

So for a file like this:

enter image description here

You would get something like this:

$ unzip numpy.xlsx -d numpy

$ xmllint --format numpy/xl/worksheets/sheet1.xml | grep 1054
        <v>1054.4875857854599</v>

This is more or less what you are seeing when you read the file back in using Pandas.

like image 188
jmcnamara Avatar answered Nov 15 '22 04:11

jmcnamara