I'm trying to read data from a .xlsx file into a MySQL database using Python.
Here's my code:
wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']
conn = MySQLdb.connect()
cursor = conn.cursor()
cursor.execute("SET autocommit = 0")
for row in ws.iter_rows(row_offset=1):
sql_row = # data i need
cursor.execute("INSERT sql_row")
conn.commit()
Unfortunately, openpyxl
's ws.iter_rows()
is painfully slow. I've tried similar methods using the xlrd
and pandas
modules. Still slow. Any thoughts?
You really need to benchmark your code and provide information about the size of the worksheet and the time taken to process it.
openpyxl's read-only mode is essentially a memory optimisation which avoids loading the whole worksheet into memory. When it comes to parsing Excel worksheets most of the work involved is converting XML to Python and there are limits to this.
However, two optimisations do spring to mind:
executemany
to pass lots of rows at once to the driverThese can be combined in something like like
INSERT_SQL = "INSERT INTO mytable (name, age…) VALUES (%s, %s, …)"
c.executemany(INSERT_SQL, ws.values)
If you only want a subset of the rows then look at using itertools.islice
This should be faster than your current code but you shouldn't expect miracles.
When it comes to pure performance, xlrd is a little faster than openpyxl when reading worksheets because it has a smaller memory footprint, largely related to being a read-only library. But it always loads a whole workbook into memory, which might not be want you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With