Pandas gets ridiculously slow when loading more than 10 million records from a SQL Server DB using pyodbc and mainly the function pandas.read_sql(query,pyodbc_conn). The following code takes up to 40-45 minutes to load 10-15 million records from SQL table: <code>Table1</code> Is there a better and faster method to read SQL Table into pandas Dataframe? <pre class="prettyprint"><code>import pyodbc import pandas server = <server_ip> database = <db_name> username = <db_user> password = <password> port='1443' conn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';PORT='+port+';DATABASE='+database+';UID='+username+';PWD='+ password) cursor = conn.cursor() data = pandas.read_sql("select * from Table1", conn) #Takes about 40-45 minutes to complete </code></pre>

I had a same problem with even more number of rows, ~50 M Ended up writing a SQL query and stored them as .h5 files. <pre class="prettyprint"><code>sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5) hdf_fn = '/path/to/result.h5' hdf_key = 'my_huge_df' store = pd.HDFStore(hdf_fn) cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>] for chunk in sql_reader: store.append(hdf_key, chunk, data_columns=cols_to_index, index=False) # index data columns in HDFStore store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full') store.close() </code></pre> This way, we'll be able to read them faster than a Pandas.read_csv

Slow loading SQL Server table into pandas DataFrame

Tags:

python

sql-server

pandas

pyodbc

Pandas gets ridiculously slow when loading more than 10 million records from a SQL Server DB using pyodbc and mainly the function pandas.read_sql(query,pyodbc_conn). The following code takes up to 40-45 minutes to load 10-15 million records from SQL table: Table1

Is there a better and faster method to read SQL Table into pandas Dataframe?

import pyodbc
import pandas

server = <server_ip> 
database = <db_name> 
username = <db_user> 
password = <password> 
port='1443'
conn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';PORT='+port+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = conn.cursor()

data = pandas.read_sql("select * from Table1", conn) #Takes about 40-45 minutes to complete

813

asked Nov 19 '18 21:11

Anjana Shivangi

1 Answers

I had a same problem with even more number of rows, ~50 M Ended up writing a SQL query and stored them as .h5 files.

sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)

hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]

for chunk in sql_reader:
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

This way, we'll be able to read them faster than a Pandas.read_csv

103

answered Oct 22 '22 15:10

Sai Praneeth

Related questions
                            
                                LinearNDInterpolator -- Qhull precision error: Initial simplex is flat
                            
                                newrelic agent is not sending data to newrelic servers at staging only
                            
                                Python - Get list of all attributes/properties of a win32com class
                            
                                Select Multilines using Lasso Tool
                            
                                Passing arguments to cell magic %%script
                            
                                Scrapy process less than succesfully crawled
                            
                                Whatsapp Automated Bot not able to search in WhatsApp Contact List
                            
                                Correctly setting up Flask-SQLAlchemy for multiple celery workers and threads
                            
                                Passing OpenCv Mat from C++ to Python
                            
                                nested json to pandas very slow
                            
                                deeplab Restoring from checkpoint failed when training on own dataset
                            
                                How to find which TensorFlow is installed in my windows system? Whether it is CPU or GPU TensorFlow
                            
                                In Tensorflow, when use dataset.shuffle(1000), am I only using 1000 data from my whole dataset?
                            
                                How to use `transform_graph` in Tensorflow
                            
                                How should i find the numeric columns in a dataframe which also contain Null values?
                            
                                restore Tensorflow model without extracting from directory
                            
                                Problem renaming all HDF5 datasets in group for large hdf5 files
                            
                                Implementing a batch dependent loss in Keras
                            
                                The flask host adress in docker run
                            
                                Pyspark and local variables inside UDFs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With