Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to read .xlsx file with Python

I'm trying to read data from a .xlsx file into a MySQL database using Python.

Here's my code:

wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']

conn = MySQLdb.connect()
cursor = conn.cursor()

cursor.execute("SET autocommit = 0")

for row in ws.iter_rows(row_offset=1):
     sql_row = # data i need
     cursor.execute("INSERT sql_row")

conn.commit() 

Unfortunately, openpyxl's ws.iter_rows() is painfully slow. I've tried similar methods using the xlrd and pandas modules. Still slow. Any thoughts?

like image 985
Brinley Avatar asked Jun 26 '17 17:06

Brinley


Video Answer


1 Answers

You really need to benchmark your code and provide information about the size of the worksheet and the time taken to process it.

openpyxl's read-only mode is essentially a memory optimisation which avoids loading the whole worksheet into memory. When it comes to parsing Excel worksheets most of the work involved is converting XML to Python and there are limits to this.

However, two optimisations do spring to mind:

  • keep your SQL statement outside the loop
  • use executemany to pass lots of rows at once to the driver

These can be combined in something like like

INSERT_SQL = "INSERT INTO mytable (name, age…) VALUES (%s, %s, …)"
c.executemany(INSERT_SQL, ws.values)

If you only want a subset of the rows then look at using itertools.islice

This should be faster than your current code but you shouldn't expect miracles.

When it comes to pure performance, xlrd is a little faster than openpyxl when reading worksheets because it has a smaller memory footprint, largely related to being a read-only library. But it always loads a whole workbook into memory, which might not be want you want.

like image 186
Charlie Clark Avatar answered Oct 20 '22 18:10

Charlie Clark