Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading mysql table into python takes a very long time compared to R

Tags:

python

mysql

r

I have a fairly large mysql table, about 30M rows, 6 columns, about 2gb when loaded into memory.

I work with both python and R. In R, I can load the table into memory and it takes about 90 seconds. But in python it takes 40 minutes.

I've tried it both with sqlalchemy and with plain pymysql. The code is straightforward, for example, with sqlalchemy:

db_engine = sqlalchemy.create_engine("mysql+pymysql://user:pass@host/database")
cnx = db_engine.connect()
table = pd.read_sql('select * from my_table',cnx)
cnx.close()

Without sqlalchemy:

cnx = mysql.connector.connect(**db_details)
cursor = cnx.cursor()
cursor.execute('select * from my_table')
table = pd.DataFrame(data=list(cursor),columns = cursor.column_names)
cnx.close()

Either way, it is much, much slower than R, which doesn't make much sense to me. Why is this, and is there any way to speed it up? Even a hack would do.

To add, pandas doesn't have anything to do with it taking so long. In the second code snippet, if I just return list(cursor) instead of putting it into a pandas DataFrame, it takes (basically) just as long.

On edit: The DB is running on the same machine as R/Python, so everything should be identical in terms of throughput.

In R I am using DBI, and the R code I use is (basically) this:

require(DBI)
cnx <- dbConnect(dbDriver("MySQL"),dbname="database",username="user",password="pass",host="host")
table <- dbGetQuery(cnx,"select * from my_table")

******** RESOLVED (MOSTLY) ********

Thanks to helpful comments, particularly from @roganjosh, it appears that the issue is that the default mysql connector is written in python rather than C, which makes it very slow. The solution is to use MySQLdb, which is a native C connector.

In my particular setup, running python 3 with anaconda, that wasn't possible because MySQLdb is only supported in python 2. However, there is an implementation of MySQLdb for python 3 under the name mysqlclient.

Using this implementation the time is now down to about 5 minutes to read the whole table, not as fast as R, but much less than the 40 or so it was taking before.

like image 417
mrip Avatar asked Feb 08 '17 16:02

mrip


Video Answer


1 Answers

Thanks to helpful comments, particularly from @roganjosh, it appears that the issue is that the default mysql connector is written in python rather than C, which makes it very slow. The solution is to use MySQLdb, which is a native C connector.

In my particular setup, running python 3 with anaconda, that wasn't possible because MySQLdb is only supported in python 2. However, there is an implementation of MySQLdb for python 3 under the name mysqlclient.

Using this implementation the time is now down to about 5 minutes to read the whole table, not as fast as R, but much less than the 40 or so it was taking before.

I'm still open to suggestions that would make it faster, but my guess is that this is as good as it's going to get.

like image 141
mrip Avatar answered Oct 20 '22 01:10

mrip