Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle

Using pandas dataframe's to_sql method, I can write a small number of rows to a table in oracle database pretty easily:

from sqlalchemy import create_engine
import cx_Oracle
dsn_tns = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host>)(PORT=1521))\
       (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=<servicename>)))"
pwd = input('Please type in password:')
engine = create_engine('oracle+cx_oracle://myusername:' + pwd + '@%s' % dsn_tns)
df.to_sql('test_table', engine.connect(), if_exists='replace')

But with any regular-sized dataframes (mine has 60k rows, not so big), the code became unusable as it never finished in the time I was willing to wait (definitely more than 10 min). I've googled and searched quite a few times and the closest solution was the answer given by ansonw in this question. But that one was about mysql, not oracle. As Ziggy Eunicien pointed out, it did not work for oracle. Any ideas?

EDIT

Here's a sample of rows in the dataframe:

id          name            premium     created_date    init_p  term_number uprate  value   score   group   action_reason
160442353   LDP: Review     1295.619617 2014-01-20  1130.75     1           7       -42 236.328243  6       pass
164623435   TRU: Referral   453.224880  2014-05-20  0.00        11          NaN     -55 38.783290   1       suppress

and here is the data types for the df:

id               int64
name             object
premium          float64
created_date     object
init_p           float64
term_number      float64
uprate           float64
value            float64
score            float64
group            int64
action_reason    object
like image 593
breezymri Avatar asked Mar 10 '17 21:03

breezymri


2 Answers

Pandas + SQLAlchemy per default save all object (string) columns as CLOB in Oracle DB, which makes insertion extremely slow.

Here are some tests:

import pandas as pd
import cx_Oracle
from sqlalchemy import types, create_engine

#######################################################
### DB connection strings config
#######################################################
tns = """
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = my-db-scan)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = my_service_name)
    )
  )
"""

usr = "test"
pwd = "my_oracle_password"

engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (usr, pwd, tns))

# sample DF [shape: `(2000, 11)`]
# i took your 2 rows DF and replicated it: `df = pd.concat([df]* 10**3, ignore_index=True)`
df = pd.read_csv('/path/to/file.csv')

DF info:

In [61]: df.shape
Out[61]: (2000, 11)

In [62]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
id               2000 non-null int64
name             2000 non-null object
premium          2000 non-null float64
created_date     2000 non-null datetime64[ns]
init_p           2000 non-null float64
term_number      2000 non-null int64
uprate           1000 non-null float64
value            2000 non-null int64
score            2000 non-null float64
group            2000 non-null int64
action_reason    2000 non-null object
dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
memory usage: 172.0+ KB

Let's check how long will it take to store it to Oracle DB:

In [57]: df.shape
Out[57]: (2000, 11)

In [58]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace')
1 loop, best of 1: 16 s per loop

In Oracle DB (pay attention at CLOB's):

AAA> desc test.test_table
 Name                            Null?    Type
 ------------------------------- -------- ------------------
 ID                                       NUMBER(19)
 NAME                                     CLOB        #  !!!
 PREMIUM                                  FLOAT(126)
 CREATED_DATE                             DATE
 INIT_P                                   FLOAT(126)
 TERM_NUMBER                              NUMBER(19)
 UPRATE                                   FLOAT(126)
 VALUE                                    NUMBER(19)
 SCORE                                    FLOAT(126)
 group                                    NUMBER(19)
 ACTION_REASON                            CLOB        #  !!!

Now let's instruct pandas to save all object columns as VARCHAR data types:

In [59]: dtyp = {c:types.VARCHAR(df[c].str.len().max())
    ...:         for c in df.columns[df.dtypes == 'object'].tolist()}
    ...:

In [60]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp)
1 loop, best of 1: 335 ms per loop

This time it was approx. 48 times faster

Check in Oracle DB:

 AAA> desc test.test_table
 Name                          Null?    Type
 ----------------------------- -------- ---------------------
 ID                                     NUMBER(19)
 NAME                                   VARCHAR2(13 CHAR)        #  !!!
 PREMIUM                                FLOAT(126)
 CREATED_DATE                           DATE
 INIT_P                                 FLOAT(126)
 TERM_NUMBER                            NUMBER(19)
 UPRATE                                 FLOAT(126)
 VALUE                                  NUMBER(19)
 SCORE                                  FLOAT(126)
 group                                  NUMBER(19)
 ACTION_REASON                          VARCHAR2(8 CHAR)        #  !!!

Let's test it with 200.000 rows DF:

In [69]: df.shape
Out[69]: (200000, 11)

In [70]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp, chunksize=10**4)
1 loop, best of 1: 4.68 s per loop

It took ~5 seconds for 200K rows DF in my test (not the fastest) environment.

Conclusion: use the following trick in order to explicitly specify dtype for all DF columns of object dtype when saving DataFrames to Oracle DB. Otherwise it'll be saved as CLOB data type, which requires special treatment and makes it very slow

dtyp = {c:types.VARCHAR(df[c].str.len().max())
        for c in df.columns[df.dtypes == 'object'].tolist()}

df.to_sql(..., dtype=dtyp)
like image 185
MaxU - stop WAR against UA Avatar answered Oct 08 '22 21:10

MaxU - stop WAR against UA


You can just use method='multi' and this will boost your data insertion speed.

You can also, adjust the chunksize as per your need, Depends on your data.

I found this when I tried to write a google cloud function which have ability to load data from csv files/excel into dataframe and I would like to save that dataframe to the postgresql database in google cloud sql.

This is the handy tool to use, if you can create a similar structure in the dataframe as in your database table.

df.to_sql(
    'table_name',
    con=engine, 
    if_exists='append', 
    index=False, 
    chunksize=2000,
    method='multi'
)
like image 34
Jayesh Manani Avatar answered Oct 08 '22 21:10

Jayesh Manani