Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?

I have a dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.

Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.

Thanks!

My code is:

import pandas as pd from sqlalchemy import create_engine, MetaData, Table, select ServerName = "myserver" Database = "mydatabase" TableName = "mytable"  engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) conn = engine.connect()  metadata = MetaData(conn)  my_data_frame.to_sql(TableName,engine) 
like image 610
Pythonista anonymous Avatar asked Apr 17 '15 17:04

Pythonista anonymous


People also ask

How long does DF To_sql take?

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates.

Is Panda faster than SQL?

This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.

Is pandas read SQL slow?

Reading SQL queries into Pandas dataframes is a common task, and one that can be very slow. Depending on the database being used, this may be hard to get around, but for those of us using Postgres we can speed this up considerably using the COPY command.


2 Answers

I recently had the same problem and feel like to add an answer to this for others. to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24.0 there is a method parameter in pandas.to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

Note that your Database may has a parameter limit. In that case you also have to define a chunksize.

So the solution should simply look like to this:

my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi') 

If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.

like image 97
NemesisMF Avatar answered Sep 19 '22 20:09

NemesisMF


The DataFrame.to_sql method generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.

When this is slow, it is not the fault of pandas.

Saving the output of the DataFrame.to_sql method to a file, then replaying that file over an ODBC connector will take the same amount of time.

The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT

For example:

BULK INSERT mydatabase.myschema.mytable FROM 'mydatadump.csv'; 

The syntax reference is as follows:

BULK INSERT     [ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ]        FROM 'data_file'       [ WITH      (     [ [ , ] BATCHSIZE = batch_size ]     [ [ , ] CHECK_CONSTRAINTS ]     [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]     [ [ , ] DATAFILETYPE =        { 'char' | 'native'| 'widechar' | 'widenative' } ]     [ [ , ] FIELDTERMINATOR = 'field_terminator' ]     [ [ , ] FIRSTROW = first_row ]     [ [ , ] FIRE_TRIGGERS ]     [ [ , ] FORMATFILE = 'format_file_path' ]     [ [ , ] KEEPIDENTITY ]     [ [ , ] KEEPNULLS ]     [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]     [ [ , ] LASTROW = last_row ]     [ [ , ] MAXERRORS = max_errors ]     [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]     [ [ , ] ROWS_PER_BATCH = rows_per_batch ]     [ [ , ] ROWTERMINATOR = 'row_terminator' ]     [ [ , ] TABLOCK ]     [ [ , ] ERRORFILE = 'file_name' ]      )]  
like image 36
firelynx Avatar answered Sep 19 '22 20:09

firelynx