Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xlrd crashes when reading .xls file modified by PHPExcel

I'm having an extremely difficult time editing some Excel files using PHP and python.

I originally did everything in PHP using PHPExcel, but I was processing very large files and PHPExcel crashed when it ran out of memory. So I changed it to do some work using PHP and do the rest using python.

So the process is:

  • Parse xml posted to PHP script
  • Insert rows into Excel (.xls) file based on xml data
  • Pass (.xls) file and xml data to python script to populate the spreadsheet
  • ex. python upload.py Example.xls data.xml called by PHP
  • python script uses xlrd, xlwt and xlutils to populate Excel file

The problem I'm having is that if the python script modifies a regular .xls file that I created by hand, it works perfectly. But once PHP excel modifies the Excel file, the python script produces the following error:

_locate_stream(Workbook): seen
  0  5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
 20  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
100= 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
120  4 4 4 4 4 4 4 4 4 4 4 4 4 3 2 2 
File "upload.py", line 63, in <module>
workbook_readonly = open_workbook(excel,formatting_info=True,on_demand=True)
File "/home/student/eamorde/public_html/dining/xlrd/__init__.py", line 435, in open_workbook
ragged_rows=ragged_rows,
File "/home/student/eamorde/public_html/dining/xlrd/book.py", line 87, in open_workbook_xls
ragged_rows=ragged_rows,
File "/home/student/eamorde/public_html/dining/xlrd/book.py", line 619, in biff2_8_load
cd.locate_named_stream(UNICODE_LITERAL(qname))
File "/home/student/eamorde/public_html/dining/xlrd/compdoc.py", line 390, in locate_named_stream
d.tot_size, qname, d.DID+6)
File "/home/student/eamorde/public_html/dining/xlrd/compdoc.py", line 418, in _locate_stream
raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4

So I dug through the source code of xlrd and found the line that is producing the error:

def _locate_stream(self, mem, base, sat, sec_size, start_sid, expected_stream_size, qname, seen_id):
    # print >> self.logfile, "_locate_stream", base, sec_size, start_sid, expected_stream_size
    s = start_sid
    if s < 0:
        raise CompDocError("_locate_stream: start_sid (%d) is -ve" % start_sid)
    p = -99 # dummy previous SID
    start_pos = -9999
    end_pos = -8888
    slices = []
    tot_found = 0
    found_limit = (expected_stream_size + sec_size - 1) // sec_size
    while s >= 0:
        if self.seen[s]:
            print("_locate_stream(%s): seen" % qname, file=self.logfile); dump_list(self.seen, 20, self.logfile)
            raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))

The last line is the one raising the exception:

raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))

Can anyone explain this? The file is not corrupted in that opening it in Excel works fine, but xlrd seems to be unable to read it.

My PHP script does the following (rough sketch):

$phpExcel = new PHPExcel();
$file = "MyFile.xls";
$reader = new PHPExcel_Reader_Excel5();
$phpExcel = $reader->load($file);
//(... insert rows based on xml)
$writer = new PHPExcel_Writer_Excel5();
$writer->save('MyFile.xls');
exec("python upload.py MyFile.xls data.xml");

If anyone knows why this might be happening or even a better solution to my problems (PHPExcel memory issues) it would be greatly appreciated.

Edit: The source code for the file that's raising the error can be found here.

Edit: I created an example, basically took my excel file and removed any identifying information. To try it yourself, see the gist here.

like image 585
Eric Amorde Avatar asked May 17 '13 22:05

Eric Amorde


1 Answers

I got same error with one of my .xls files (excel can open them just fine, but xlrd fails). As I guess Compdoc.seen array keeps track of already read "FAT" sectors. In my case Root Entry reading block (SSCS) gets all that sectors marked as seen, leading to exception raise in future. U can try to find the bug in sectors reading logic and contribute to xlrd :) or just comment this lines with exception raise which will likely solve problem in your case (As did in mine) and wait for xlrd update.

like image 188
glmvrml Avatar answered Nov 14 '22 22:11

glmvrml