I'm having problems reading .xls files written by a Perl script which I have no control over. The files contain some formatting and line breaks within cells.
filename = '/home/shared/testfile.xls'
book = xlrd.open_workbook(filename)
sheet = book.sheet_by_index(0)
for rowIndex in xrange(1, sheet.nrows):
row = sheet.row(rowIndex)
This is throwing the following error:
_locate_stream(Workbook): seen
0 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
20 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172480= 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172500 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 2
172520 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173840= 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173860 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
173880 1 1 1 1 1 1 1 1
Traceback (most recent call last):
File "/home/shared/xlrdtest.py", line 5, in <module>
book = xlrd.open_workbook(filename)
File "/usr/local/lib/python2.7/site-packages/xlrd/__init__.py", line 443, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 84, in open_workbook_xls
ragged_rows=ragged_rows,
File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 616, in biff2_8_load
self.mem, self.base, self.stream_len = cd.locate_named_stream(qname)
File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 393, in locate_named_stream
d.tot_size, qname, d.DID+6)
File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 421, in _locate_stream
raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4
I'm not able to find any info about CompDocError or Workbook corruption, even less the seen[2] == 4 part.
You are have xlrd installed on your cluster and are attempting to read files in the Excel .xlsx format when you get an error. xlrd 2.0.0 and above can only read .xls files. Support for .xlsx files was removed from xlrd due to a potential security vulnerability. Use openpyxl to open .xlsx files instead of xlrd.
Python xlrd is a very useful library when you are dealing with some older version of the excel files (.xls). In this tutorial, I will share with you how to use this library to read data from .xls file.
Use openpyxl to open .xlsx files instead of xlrd. Install the openpyxl library on your cluster. Confirm that you are using pandas version 1.0.1 or above.
With xlrd to open a Workbook, you use the open_workbook command and assign it to a variable: workbookData = xlrd.open_workbook ("myWorkbook.xlsx") Now, the variable workbookData contains everything about that Excel workbook.
+1 to Ramiel.
Just comment out these lines in compdoc.py
(lines 425-27
in xlrd 1.2.0
):
if self.seen[s]:
print("_locate_stream(%s): seen" % qname, file=self.logfile);dump_list(self.seen, 20, self.logfile)
raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
From pkm comment (link) the problem is with a Compound File Binary
#pip install OleFileIO-PL
import OleFileIO_PL
import pandas as pd
path = 'file.xls'
with open(path,'rb') as file:
ole = OleFileIO_PL.OleFileIO(file)
if ole.exists('Workbook'):
d = ole.openstream('Workbook')
x=pd.read_excel(d,engine='xlrd')
print(x.head())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With