Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to read huge MySQL table in python

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:

import pandas as pd
import numpy as np
import pymysql.cursors

connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')

try:
    with connection.cursor() as cursor:
        query = "SELECT * FROM example_table;"

        chunks=[]

        for chunk in pd.read_sql(query, connection, chunksize = 1000):
            chunks.append(chunk)
        #print(len(chunks))    
        result = pd.concat(chunks, ignore_index=True)
        #print(type(result))
        #print(result)

finally:
    print("Done!")

    connection.close()

Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.

Maybe is there a better/faster way to select the data from a relational database within python?

like image 510
UgoL Avatar asked Jul 03 '18 10:07

UgoL


1 Answers

Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.

Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.

This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.

Example 1

import pandas as pd
import MySQLdb

def worker(y):
    #where y is value in an indexed column, e.g. a category
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x

p.close()
results = pd.concat(data) 

Example 2

import pandas as pd
import MySQLdb
import datetime

def worker(a,b):
    #where a and b are timestamps
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)

date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)

p.close()
results = pd.concat(data)

Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.

like image 174
djmac Avatar answered Sep 21 '22 11:09

djmac