Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

floating numeric value read as timedate using openpyxl

I have an Excel spreadsheet that has a field containing small %f.2 values, such as 1.2, 1.07, 2.3, etc and for some reason openpyxl is reading these cells as a 1900 date. I have seen a variety of times this issue has been raised, but usually these users are expecting a date and are getting a bogus date. I am expecting a value, usually x<10.0 and i am getting about 30-40% 'bad' data (read as timedate), while the other time this is read as a numeric value.

I am using iterators, so I do a simple ws.iter_rows() call to pull the data one row at a time. I have tried to 'cast' this to a previously created variable containing a numeric value, but this doesn't do much good.

Does anyone have a suggestion on how to overcome this sporadic issue. If this is a known bug, are there any known workarounds?

I find that if I store the file as a csv, and re-open it as csv, then re-store it as xlsx I will end up with a file that I can read correctly. While this helps debug the code, I need a solution that my customer can use without jumping through these hoops.

I would think that if the column was not formatted correctly it would apply to all elements, so having this occur intermittently is confusing.

import openpyxl
from openpyxl import load_workbook

# Source workbook - wb

wb = load_workbook(filename = r'C:\data\TEST.xlsx' , use_iterators = True)
ws = wb.get_sheet_by_name(name ='QuoteFile ')

for row in ws.iter_rows():
        print(row[0].internal_value ,row[3].internal_value ,row[4].internal_value         ,row[5].internal_value)


print('Done')

Here is my input as seen from an excel table

20015   2.13    1.2 08/01/11
20015   5.03    1.2 08/01/11
20015   5.03    1.2 08/01/11
20015   5.51    1.2 08/01/11
20015   8.13    1.2 08/01/11
20015   5.60    1.2 08/01/11
20015   5.03    1.2 08/01/11
20015   1.50    1.2 08/01/11
20015   1.50    1.2 08/01/11
20015   1.50    1.2 08/01/11
20015   1.50    1.2 08/01/11
20015   1.50    1.2 08/01/11
20015   1.50    1.2 08/01/11

Here is my output, you can see the first seven rows indicate the second field as a date from 1900, while rows 8-13 show the field correctly as a numeric field:

20015.0 1900-01-02 03:07:12 1.2 2011-08-01 00:00:00
20015.0 1900-01-05 00:43:12 1.2 2011-08-01 00:00:00
20015.0 1900-01-05 00:43:12 1.2 2011-08-01 00:00:00
20015.0 1900-01-05 12:14:24 1.2 2011-08-01 00:00:00
20015.0 1900-01-08 03:07:12 1.2 2011-08-01 00:00:00
20015.0 1900-01-05 14:24:00 1.2 2011-08-01 00:00:00
20015.0 1900-01-05 00:43:12 1.2 2011-08-01 00:00:00
20015.0 1.5 1.2 2011-08-01 00:00:00
20015.0 1.5 1.2 2011-08-01 00:00:00
20015.0 1.5 1.2 2011-08-01 00:00:00
20015.0 1.5 1.2 2011-08-01 00:00:00
20015.0 1.5 1.2 2011-08-01 00:00:00
20015.0 1.5 1.2 2011-08-01 00:00:00

using python 3.3 and openpyxl 1.6.2

like image 860
Joel Avatar asked Nov 03 '22 20:11

Joel


1 Answers

Disclaimer: I have no idea how to work openpyxl. However, you mostly only need to worry about the datetime module.

If you know which rows are supposed to be numbers, you could try code like this to convert an Excel date-format to a float, and ignore it if it's a number:

import datetime
import openpyxl
from openpyxl import load_workbook

# Source workbook - wb

wb = load_workbook(filename = r'C:\data\TEST.xlsx' , use_iterators=True)
ws = wb.get_sheet_by_name(name='QuoteFile ')

If val's a number, return it. Otherwise, take the difference between the datetime
and 1899-12-31 00:00:00. The way the datetimes work is they're internally a float,
being the number of days since the start of 1900. We get the number of seconds in
the delta (done through subtraction) and divide that by 86400 (the number of seconds
in a day).
def forcefloat(val):
    """If val's a number, return it. Otherwise, take the difference between the
    datetime and 1899-12-31 00:00:00. The way the datetimes work is they're
    internally a float, being the number of days since the start of 1900.
    We get the number of seconds in the delta (done through subtraction)
    and divide that by 86400 (the number of seconds in a day)."""
    if isinstance(val, (int, float)):
        return val
    assert isinstance(val, datetime.datetime)
    return (val - datetime.datetime(1899,12,31,0,0,0)).total_seconds() / 86400

for row in ws.iter_rows():
        print(
            row[0].internal_value,
            forcefloat(row[3].internal_value),
            row[4].internal_value,
            row[5].internal_value,
        )

print('Done')

Not exactly the most elegant solution, but it works.

like image 86
Alyssa Haroldsen Avatar answered Nov 15 '22 05:11

Alyssa Haroldsen