Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow loading SQL Server table into pandas DataFrame

Pandas gets ridiculously slow when loading more than 10 million records from a SQL Server DB using pyodbc and mainly the function pandas.read_sql(query,pyodbc_conn). The following code takes up to 40-45 minutes to load 10-15 million records from SQL table: Table1

Is there a better and faster method to read SQL Table into pandas Dataframe?

import pyodbc
import pandas

server = <server_ip> 
database = <db_name> 
username = <db_user> 
password = <password> 
port='1443'
conn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';PORT='+port+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = conn.cursor()

data = pandas.read_sql("select * from Table1", conn) #Takes about 40-45 minutes to complete
like image 813
Anjana Shivangi Avatar asked Nov 19 '18 21:11

Anjana Shivangi


People also ask

Is pandas read SQL slow?

pandas read_sql is unusually slow.

Is Panda faster than SQL?

This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.

Is pandas Read_sql fast?

Reading SQL queries into Pandas dataframes is a common task, and one that can be very slow. Depending on the database being used, this may be hard to get around, but for those of us using Postgres we can speed this up considerably using the COPY command.


1 Answers

I had a same problem with even more number of rows, ~50 M Ended up writing a SQL query and stored them as .h5 files.

sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)

hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]

for chunk in sql_reader:
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

This way, we'll be able to read them faster than a Pandas.read_csv

like image 103
Sai Praneeth Avatar answered Oct 22 '22 15:10

Sai Praneeth