Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load in Python an xlsx that originally had .xls file extension?

I'm using xlrd to process .xls files, and openpyxl to process .xlsx files, and this is working well.

Then I'm handed what is ostensibly a .xls file, so I try to xlrd.open_workbook(), and get:

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

I take a look at this question, and I surmise that my file, although ending with extension .xls, must actually be a .xlsx. And indeed, I can view it in a text editor:

<?xml version="1.0" encoding="UTF-8"?>
                      <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
                       xmlns:x="urn:schemas-microsoft-com:office:excel"
                       xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
                       xmlns:html="http://www.w3.org/TR/REC-html40">
:
:
:

(for privacy reasons, I can't post the whole file, but it's probably not required for our analysis).

So I surmise that if I just copy (cp) it to a .xlsx, I should be able to open it with openpyxl.load_workbook(), but I get:

BadZipfile: File is not a zip file

If it's actually an xls (unlikely) but can't be opened with xlrd, and if it is atcually an xlsx but can't be opened with openpyxl, even after I cp it to a .xlsx, what to do?

Note: If I open up the .xls in Excel, save it as a .xlsx, and retry with openpyxl, it does load fine, but this manual step is not a luxury I will have in the executing of my program.

like image 719
Pyderman Avatar asked Jul 03 '15 17:07

Pyderman


2 Answers

One thing is clear: The file you're trying to open has a different format than its extension suggests.

As you already know, Excel file formats include (but are not limited to) xls and xlsx.

  • The Excel 2003 format (xls) is a binary format. This means that if you open a xls file with a text editor, you'll just see gibberish.

  • The Excel 2007 format (xlsx) is quite different. A xlsx file is a zip file with a bunch of XML files inside. You can use a zip archiver to extract the contents of the xlsx file. Then, you can edit the XML files using any text editor. However, opening a xlsx file directly with a text editor is like opening a zip file with a text editor: You'll just see gibberish.

The fact that you can open your file with a text editor (and read its contents) shows that it's neither a xls file nor a xlsx file. Your file is neither a binary file nor a zip file, it's a plain XML file.

Moreover, this error message says a lot.

BadZipfile: File is not a zip file

It means that openpyxl is trying to open your file as a xlsx file and therefore a zip file. But when it tries to extract its contents, it fails, because your file isn't even a zip file.

But if the file is neither a xlsx file nor a xls file, how can Microsoft Excel read it? I wondered that too. After some research, I believe your file has the XML Spreadsheet 2003 file format. This example looks very similar to the file content you posted. Since Microsoft Excel supports this format, it's no wonder that it can read your file.

Unfortunately, Python libraries such as xlrd and openpyxl only support xls and xlsx file formats, so they won't be able to read your file. I think you'll just have to manually convert it to a supported format.

like image 166
Zenadix Avatar answered Sep 29 '22 09:09

Zenadix


I am not on OSX, so this is not tested. You may be able to use the appscript package, despite it's lack of support, to open the offending file and the resave it.

from appscript import *
excel = app('Microsoft Excel')
wb = excel.open('/path/to/file.xls')
wb.save_as('/path/to/fileout.xlsx', file_format=k.XLSX_file_format)
#not sure the exact name of k.excel_file
like image 33
James Avatar answered Sep 29 '22 09:09

James