Reading Excel file is magnitudes slower using openpyxl compared to xlrd

Tags:

I have an Excel spreadsheet that I need to import into SQL Server on a daily basis. The spreadsheet will contain around 250,000 rows across around 50 columns. I have tested both using openpyxl and xlrd using nearly identical code.

Here's the code I'm using (minus debugging statements):

import xlrd import openpyxl  def UseXlrd(file_name):     workbook = xlrd.open_workbook(file_name, on_demand=True)     worksheet = workbook.sheet_by_index(0)     first_row = []     for col in range(worksheet.ncols):         first_row.append(worksheet.cell_value(0,col))     data = []     for row in range(1, worksheet.nrows):         record = {}         for col in range(worksheet.ncols):             if isinstance(worksheet.cell_value(row,col), str):                 record[first_row[col]] = worksheet.cell_value(row,col).strip()             else:                 record[first_row[col]] = worksheet.cell_value(row,col)         data.append(record)     return data   def UseOpenpyxl(file_name):     wb = openpyxl.load_workbook(file_name, read_only=True)     sheet = wb.active     first_row = []     for col in range(1,sheet.max_column+1):         first_row.append(sheet.cell(row=1,column=col).value)     data = []     for r in range(2,sheet.max_row+1):         record = {}         for col in range(sheet.max_column):             if isinstance(sheet.cell(row=r,column=col+1).value, str):                 record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()             else:                 record[first_row[col]] = sheet.cell(row=r,column=col+1).value         data.append(record)     return data  xlrd_results = UseXlrd('foo.xls') openpyxl_resuts = UseOpenpyxl('foo.xls')

Passing the same Excel file containing 3500 rows gives drastically different run times. Using xlrd I can read the entire file into a list of dictionaries in under 2 second. Using openpyxl I get the following results:

Reading Excel File... Read 100 lines in 114.14509415626526 seconds Read 200 lines in 471.43183994293213 seconds Read 300 lines in 982.5288782119751 seconds Read 400 lines in 1729.3348784446716 seconds Read 500 lines in 2774.886833190918 seconds Read 600 lines in 4384.074863195419 seconds Read 700 lines in 6396.7723388671875 seconds Read 800 lines in 7998.775000572205 seconds Read 900 lines in 11018.460735321045 seconds

While I can use xlrd in the final script, I will have to hard code a lot of formatting because of various issues (i.e. int reads as float, date reads as int, datetime reads as float). Being that I need to reuse this code for a few more imports, it doesn't make sense to try and hard code specific columns to format them properly and have to maintain similar code across 4 different scripts.

Any advice on how to proceed?

221

asked Mar 06 '16 06:03

Ron Johnson

1 Answers

You can just iterate over the sheet:

def UseOpenpyxl(file_name):     wb = openpyxl.load_workbook(file_name, read_only=True)     sheet = wb.active     rows = sheet.rows     first_row = [cell.value for cell in next(rows)]     data = []     for row in rows:         record = {}         for key, cell in zip(first_row, row):             if cell.data_type == 's':                 record[key] = cell.value.strip()             else:                 record[key] = cell.value         data.append(record)     return data

This should scale to large files. You may want to chunk your result if the list data gets too large.

Now the openpyxl version takes about twice as long as the xlrd one:

%timeit xlrd_results = UseXlrd('foo.xlsx') 1 loops, best of 3: 3.38 s per loop  %timeit openpyxl_results = UseOpenpyxl('foo.xlsx') 1 loops, best of 3: 6.87 s per loop

Note that xlrd and openpyxl might interpret what is an integer and what is a float slightly differently. For my test data, I needed to add float() to make the outputs comparable:

def UseOpenpyxl(file_name):     wb = openpyxl.load_workbook(file_name, read_only=True)     sheet = wb.active     rows = sheet.rows     first_row = [float(cell.value) for cell in next(rows)]     data = []     for row in rows:         record = {}         for key, cell in zip(first_row, row):             if cell.data_type == 's':                 record[key] = cell.value.strip()             else:                 record[key] = float(cell.value)         data.append(record)     return data

Now, both versions give the same results for my test data:

>>> xlrd_results == openpyxl_results True

124

answered Sep 20 '22 22:09

Mike Müller

Related questions
                            
                                Java's TreeSet equivalent in Python?
                            
                                What is the correct way to override the __dir__ method?
                            
                                Non-blocking file access with Twisted
                            
                                Distribute a Python package with a compiled dynamic shared library
                            
                                Integration of Python console into a GUI C++ application
                            
                                why is converting a long 2D list to numpy array so slow?
                            
                                How does Spark running on YARN account for Python memory usage?
                            
                                Python: argparse optional arguments without dashes
                            
                                Python 2 and Python 3 dual development
                            
                                How to list dependencies for a python library without installing? [duplicate]
                            
                                Airflow Python Unit Test?
                            
                                celery shutdown worker after particular task
                            
                                Parsing HTML in Python [closed]
                            
                                How to check whether string might be type-cast to float in Python? [duplicate]
                            
                                Efficiently get indices of histogram bins in Python
                            
                                Make predictions using a tensorflow graph from a keras model
                            
                                Weird bug in Pandas and Numpy regarding multithreading
                            
                                Can PyPy/RPython be used to produce a small standalone executable?
                            
                                How to construct a dictionary from two dictionaries in python? [duplicate]
                            
                                Python string.format() percentage without rounding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading Excel file is magnitudes slower using openpyxl compared to xlrd

Tags:

python

xlrd

openpyxl

Ron Johnson

People also ask

1 Answers

Mike Müller

Recent Activity

Donate For Us