Fastest way to read huge MySQL table in python

Question

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:

import pandas as pd
import numpy as np
import pymysql.cursors

connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')

try:
    with connection.cursor() as cursor:
        query = "SELECT * FROM example_table;"

        chunks=[]

        for chunk in pd.read_sql(query, connection, chunksize = 1000):
            chunks.append(chunk)
        #print(len(chunks))    
        result = pd.concat(chunks, ignore_index=True)
        #print(type(result))
        #print(result)

finally:
    print("Done!")

    connection.close()

Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.

Maybe is there a better/faster way to select the data from a relational database within python?

djmac · Accepted Answer

Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.

Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.

This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.

Example 1

import pandas as pd
import MySQLdb

def worker(y):
    #where y is value in an indexed column, e.g. a category
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x

p.close()
results = pd.concat(data)

Example 2

import pandas as pd
import MySQLdb
import datetime

def worker(a,b):
    #where a and b are timestamps
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)

date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)

p.close()
results = pd.concat(data)

Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.

Fastest way to read huge MySQL table in python

Tags:

python

pandas

mysql

numpy

UgoL

1 Answers

Example 1

Example 2

djmac

Recent Activity

Donate For Us

Fastest way to read huge MySQL table in python

Tags:

python

pandas

mysql

numpy

UgoL

1 Answers

Example 1

Example 2

djmac

Related questions

Recent Activity

Donate For Us