I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.
I am trying to read in .xls
files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls
file that's freely available via the dropdown. Pandas cannot read it.
In contrast, the file for the most recent year available, 2013, coalpublic2013.xls
file, works without a problem:
import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")
but the next decade of .xls
files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.
The error that I get from pandas is:
---------------------------------------------------------------------------
XLRDError Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
161
162 if not isinstance(io, ExcelFile):
--> 163 io = ExcelFile(io, engine=engine)
164
165 return io._parse_excel(
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
204 self.book = xlrd.open_workbook(file_contents=data)
205 else:
--> 206 self.book = xlrd.open_workbook(io)
207 elif engine == 'xlrd' and isinstance(io, xlrd.Book):
208 self.book = io
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
433 formatting_info=formatting_info,
434 on_demand=on_demand,
--> 435 ragged_rows=ragged_rows,
436 )
437 return bk
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
1228 bof_error('Expected BOF record; met end of file')
1229 if opcode not in bofcodes:
-> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1231 length = self.get2bytes()
1232 if length == MY_EOF:
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1223 def bof_error(msg):
-> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1225 savpos = self._position
1226 opcode = self.get2bytes()
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'
And I have tried various other things:
df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")
to no avail. My pandas version: 0.17.0
I've also submitted this as a bug to the pandas github issues list.
You can convert this Excel XML file programmatically. Requirement: Windows, Office installed.
1.Create in Notepad ExcelToCsv.vbs script:
if WScript.Arguments.Count < 3 Then
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
Wscript.Quit
End If
csv_format = 6
Set objFSO = CreateObject("Scripting.FileSystemObject")
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate
oBook.SaveAs dest_file, csv_format
oBook.Close False
oExcel.Quit
$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1
>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)
Reference: Faster way to read Excel files to pandas dataframe
The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python. I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With