I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.
MATLAB reference:
tic;
feature accel off;
conn = database(...);
c=fetch(exec(conn,'select x,y from TABLENAME'));
cell2mat(c.data);
toc
Elapsed time is ~1 second.
Doing the same in python, using the several examples found in here (I have tried them all, i.e. using pandas read_frame, frame_query and the __processCursor function): How to convert SQL Query result to PANDAS Data Structure?
Reference python code:
import pyodbc
import pandas.io.sql as psql
import pandas
connection_info = "DRIVER={MySQL ODBC 3.51 \
Driver};SERVER=;DATABASE=;USER=;PASSWORD=;OPTION=3;"
cnxn = pyodbc.connect(connection_info)
cursor = cnxn.cursor()
sql = "select x,y from TABLENAME"
#cursor.execute(sql)
#dataframe = __processCursor(cursor, dataframe=True)
#df = psql.frame_query(sql, cnxn, coerce_float=False)
df = psql.read_frame(sql, cnxn)
cnxn.close()
Takes ~6 seconds. Profiler says all the time spent was in read_frame. I was wondering if anyone could give me some hints how could this be speedup to at least match the MATLAB code. And if that is possible at all in python.
EDIT:
The bottleneck seems to be inside the cursor.execute (in pymysql library) or cursor.fetchall() in pyodbc library. The slowest part is reading the returned MySQL data element by element (row by row, column by column) and converting it to the data type which it inferred previously by the same library.
So far I have managed to speed this up to close to MATLAB by doing this really dirty solution:
import pymysql
import numpy
conn = pymysql.connect(host='', port=, user='', passwd='', db='')
cursor = conn.cursor()
cursor.execute("select x,y from TABLENAME")
rez = cursor.fetchall()
resarray = numpy.array(map(float,rez))
finalres = resarray.reshape((resarray.size/2,2))
The above cur.execute IS NOT THE pymysql EXECUTE! I have modified it, inside the file "connections.py". First, the function def _read_rowdata_packet, now has instead of:
rows.append(self._read_row_from_packet(packet))
substituted with
self._read_string_from_packet(rows,packet)
Here _read_string_from_packet is a simplified version of _read_row_from_packet with the code:
def _read_string_from_packet(self, rows, packet):
for field in self.fields:
data = packet.read_length_coded_string()
rows.append(data)
This is an uber-dirty solution which gives a speedup down from 6 seconds to 2.5 seconds. I was wondering, if all of this could somehow be avoided by using a different library/passing some parameters?
Hence the solution would be to bulk-read the entire MySQL reply to a list of strings and then bulk-type converting to numerical data types, instead of doing that element-by-element. Does something like that already exist in python?
Also check out this way of doing things using the turbodbc package. To transform your result set into an OrderedDict of NumPy arrays, just do this:
import turbodbc
connection = turbodbc.connect(dsn="My data source name")
cursor = connection.cursor()
cursor.execute("SELECT 42")
results = cursor.fetchallnumpy()
Transforming these results to a dataset should require a few additional milliseconds. I don't know the speedup for MySQL, but I have seen factor 10 for other databases.
The speedup is mainly achieved by using bulk operations instead of row-wise operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With