Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Failing to open an Excel file with Python

I'm on a Debian GNU/Linux computer, working with Python 2.7.9.

As a part of my job, I have been making python scripts that read inputs in various formats (e.g. Excel, Csv, Txt) and parse the information to more standarized files. It's not my first time opening or working with Excel files.

There's a particular file which is giving me problems, I just can't open it. When I tried with xlrd (version 0.9.3), it gave me the following error:

xlrd.open_workbook('sample.xls')

XLRDError: Unsupported format, or corrupt file: BOF not workbook/worksheet: op=0x0009 vers=0x0002 strm=0x000a build=0 year=0 -> BIFF21

I tried to investigate the matter on my own, found a couple of answers in StackOverflow but I couldn't open it anyway. This particular answer I found may be the problem (the second explanation), but it doesn't include a workaround: https://stackoverflow.com/a/16518707/4345659

A tool that could conert the file to csv/txt would also solve the problem.

I already tried with:

  • xlrd
  • openpyxl
  • xlsx2csv (the shell tool)

A sample file is available here: https://ufile.io/r4m6j

As a side note, I can open it with LibreOffice Calc and MS Excel, so I could eventually change it to csv that way. The thing is, I need to do it all with a python script.

Thanks in advance!

like image 400
francisco sollima Avatar asked Nov 07 '22 15:11

francisco sollima


1 Answers

It seems like MS Problem. The xls file is very strange, maybe you should contact xlrd support.

But I have a crazy workaround for you: xls2ods. It works for me even though xls2csv doesn't (SiC!).

So, install catdoc first:

$sudo apt-get install catdoc

Then convert your xls file to ods and open ods using pyexcel_ods or whatever you prefer. To use pyexcel_ods install it first using pip install pyexcel_ods.

import subprocess

from pyexcel_ods import get_data

file_basename = 'sample'

returncode = subprocess.call(['xls2ods', '{}.xls'.format(file_basename)])
if returnecode > 0:
    # consider to use subprocess.Popen if you need more control on stderr
    exit(returncode)

data = get_data('{}.ods'.format(file_basename))
print(data)

I'm getting following output:

OrderedDict([(u'sample',
              [[u'labo',
                u'codfarm',
                u'farmacia',
                u'direccion',
                u'localidad',
                u'nom_medico',
                u'matricula',
                u'troquel',
                u'producto',
                u'cant_total']])])
like image 84
Shmygol Avatar answered Nov 14 '22 22:11

Shmygol