Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to fetch table from MySQL into Pandas

I am trying to determine the fastest way to fetch data from MySQL into Pandas. So far, I have tried three different approaches:

Approach 1: Using pymysql and modifying field type (inspired by Fastest way to load numeric data into python/pandas/numpy array from MySQL)

import pymysql 
from pymysql.converters import conversions
from pymysql.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = pymysql.connect(host = host, port = port, user= user, passwd= passwd, db= db)

Approach 2: Using MySqldb

import MySQLdb
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host = host, port = port, user= user, passwd= passwd, db= db)

Approach 3: Using sqlalchemy

import sqlalchemy as SQL
engine = SQL.create_engine('mysql+mysqldb://{0}:{1}@{2}:{3}/{4}'.format(user, passwd, host, port, db))

Approach 2 is the best out of these three and takes an average of 4 seconds to fetch my table. However, fetching the table only takes 2 seconds on MySQL Workbench. How can I shave off this 2 extra seconds ? Does anyone know of any alternative ways to accomplish this ?

like image 581
Sandeep Silwal Avatar asked Jun 19 '17 15:06

Sandeep Silwal


2 Answers

You can use ConnectorX library that written with rust and is about 10 times faster than pandas. This library gets data from the database and fills the dataframe.

like image 97
DrMorteza Avatar answered Oct 02 '22 16:10

DrMorteza


I think you may find answers using a specific library such as "peewee" or the function df.read_sql_query from the pandas library. To use df.read_sql_query :

MyEngine = create_engine('[YourDatabase]://[User]:[Pass]@[Host]/[DatabaseName]', echo = True)
df = pd.read_sql_query('select * from [TableName]', con= MyEngine)

Also, for uploading data from a dataframe to SQL:

df.to_sql([TableName], MyEngine, if_exists = 'append', index=False)

You must put if_exists = 'append' if the table already exists, or it will auto-default to fail. You could also put replace if you want to replace as new table as well.

For data integrity sake it's nice using dataframes for uploads and downloads due to its ability to handle data well. Depending on your size of upload, it should be pretty efficient on upload time too.

If you want to go an extra step, peewee queries may help make upload time faster, although I have not personally tested speed. Peewee is an ORM library like SQLAlchemy that I found to be very easy and expressive to develop with. You also could use dataframes as well. Just skim over the documentation - you would construct and assign a query, then convert it to a dataframe like this:

MyQuery = [TableName]select()where([TableName.column] == "value")
df = pd.DataFrame(list(MyQuery.dicts()))

Hope this helps.

like image 43
Azarias Avatar answered Oct 02 '22 16:10

Azarias