I have a fairly large mysql table, about 30M rows, 6 columns, about 2gb when loaded into memory.
I work with both python and R. In R, I can load the table into memory and it takes about 90 seconds. But in python it takes 40 minutes.
I've tried it both with sqlalchemy and with plain pymysql. The code is straightforward, for example, with sqlalchemy:
db_engine = sqlalchemy.create_engine("mysql+pymysql://user:pass@host/database")
cnx = db_engine.connect()
table = pd.read_sql('select * from my_table',cnx)
cnx.close()
Without sqlalchemy:
cnx = mysql.connector.connect(**db_details)
cursor = cnx.cursor()
cursor.execute('select * from my_table')
table = pd.DataFrame(data=list(cursor),columns = cursor.column_names)
cnx.close()
Either way, it is much, much slower than R, which doesn't make much sense to me. Why is this, and is there any way to speed it up? Even a hack would do.
To add, pandas doesn't have anything to do with it taking so long. In the second code snippet, if I just return list(cursor)
instead of putting it into a pandas DataFrame, it takes (basically) just as long.
On edit: The DB is running on the same machine as R/Python, so everything should be identical in terms of throughput.
In R I am using DBI, and the R code I use is (basically) this:
require(DBI)
cnx <- dbConnect(dbDriver("MySQL"),dbname="database",username="user",password="pass",host="host")
table <- dbGetQuery(cnx,"select * from my_table")
******** RESOLVED (MOSTLY) ********
Thanks to helpful comments, particularly from @roganjosh, it appears that the issue is that the default mysql connector is written in python rather than C, which makes it very slow. The solution is to use MySQLdb
, which is a native C connector.
In my particular setup, running python 3 with anaconda, that wasn't possible because MySQLdb
is only supported in python 2. However, there is an implementation of MySQLdb
for python 3 under the name mysqlclient
.
Using this implementation the time is now down to about 5 minutes to read the whole table, not as fast as R, but much less than the 40 or so it was taking before.
Thanks to helpful comments, particularly from @roganjosh, it appears that the issue is that the default mysql connector is written in python rather than C, which makes it very slow. The solution is to use MySQLdb
, which is a native C connector.
In my particular setup, running python 3 with anaconda, that wasn't possible because MySQLdb
is only supported in python 2. However, there is an implementation of MySQLdb
for python 3 under the name mysqlclient
.
Using this implementation the time is now down to about 5 minutes to read the whole table, not as fast as R, but much less than the 40 or so it was taking before.
I'm still open to suggestions that would make it faster, but my guess is that this is as good as it's going to get.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With