Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

openpyxl please do not assume text as a number when importing

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:

I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.

So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.

How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.

So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).

I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk

So, any suggestions?

like image 839
Dannid Avatar asked Nov 30 '12 01:11

Dannid


1 Answers

If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:

diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py

--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
@@ -134,8 +134,10 @@
             data_type = element.get('t', 'n')
             if data_type == Cell.TYPE_STRING:
                 value = string_table.get(int(value))
-
-            ws.cell(coordinate).value = value
+                ws.cell(coordinate).set_value_explicit(value=value,
+                                                data_type=Cell.TYPE_STRING)
+            else:
+                ws.cell(coordinate).value = value

         # to avoid memory exhaustion, clear the item after use
         element.clear()

The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.

As you can see from the code, the test whether it is a string was already there.

like image 188
Anthon Avatar answered Nov 15 '22 00:11

Anthon