Using pandas dataframe's to_sql method, I can write a small number of rows to a table in oracle database pretty easily:
from sqlalchemy import create_engine
import cx_Oracle
dsn_tns = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host>)(PORT=1521))\
(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=<servicename>)))"
pwd = input('Please type in password:')
engine = create_engine('oracle+cx_oracle://myusername:' + pwd + '@%s' % dsn_tns)
df.to_sql('test_table', engine.connect(), if_exists='replace')
But with any regular-sized dataframes (mine has 60k rows, not so big), the code became unusable as it never finished in the time I was willing to wait (definitely more than 10 min). I've googled and searched quite a few times and the closest solution was the answer given by ansonw in this question. But that one was about mysql, not oracle. As Ziggy Eunicien pointed out, it did not work for oracle. Any ideas?
EDIT
Here's a sample of rows in the dataframe:
id name premium created_date init_p term_number uprate value score group action_reason
160442353 LDP: Review 1295.619617 2014-01-20 1130.75 1 7 -42 236.328243 6 pass
164623435 TRU: Referral 453.224880 2014-05-20 0.00 11 NaN -55 38.783290 1 suppress
and here is the data types for the df:
id int64
name object
premium float64
created_date object
init_p float64
term_number float64
uprate float64
value float64
score float64
group int64
action_reason object
Pandas + SQLAlchemy per default save all object
(string) columns as CLOB in Oracle DB, which makes insertion extremely slow.
Here are some tests:
import pandas as pd
import cx_Oracle
from sqlalchemy import types, create_engine
#######################################################
### DB connection strings config
#######################################################
tns = """
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = my-db-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = my_service_name)
)
)
"""
usr = "test"
pwd = "my_oracle_password"
engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (usr, pwd, tns))
# sample DF [shape: `(2000, 11)`]
# i took your 2 rows DF and replicated it: `df = pd.concat([df]* 10**3, ignore_index=True)`
df = pd.read_csv('/path/to/file.csv')
DF info:
In [61]: df.shape
Out[61]: (2000, 11)
In [62]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
id 2000 non-null int64
name 2000 non-null object
premium 2000 non-null float64
created_date 2000 non-null datetime64[ns]
init_p 2000 non-null float64
term_number 2000 non-null int64
uprate 1000 non-null float64
value 2000 non-null int64
score 2000 non-null float64
group 2000 non-null int64
action_reason 2000 non-null object
dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
memory usage: 172.0+ KB
Let's check how long will it take to store it to Oracle DB:
In [57]: df.shape
Out[57]: (2000, 11)
In [58]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace')
1 loop, best of 1: 16 s per loop
In Oracle DB (pay attention at CLOB's):
AAA> desc test.test_table
Name Null? Type
------------------------------- -------- ------------------
ID NUMBER(19)
NAME CLOB # !!!
PREMIUM FLOAT(126)
CREATED_DATE DATE
INIT_P FLOAT(126)
TERM_NUMBER NUMBER(19)
UPRATE FLOAT(126)
VALUE NUMBER(19)
SCORE FLOAT(126)
group NUMBER(19)
ACTION_REASON CLOB # !!!
Now let's instruct pandas to save all object
columns as VARCHAR data types:
In [59]: dtyp = {c:types.VARCHAR(df[c].str.len().max())
...: for c in df.columns[df.dtypes == 'object'].tolist()}
...:
In [60]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp)
1 loop, best of 1: 335 ms per loop
This time it was approx. 48 times faster
Check in Oracle DB:
AAA> desc test.test_table
Name Null? Type
----------------------------- -------- ---------------------
ID NUMBER(19)
NAME VARCHAR2(13 CHAR) # !!!
PREMIUM FLOAT(126)
CREATED_DATE DATE
INIT_P FLOAT(126)
TERM_NUMBER NUMBER(19)
UPRATE FLOAT(126)
VALUE NUMBER(19)
SCORE FLOAT(126)
group NUMBER(19)
ACTION_REASON VARCHAR2(8 CHAR) # !!!
Let's test it with 200.000 rows DF:
In [69]: df.shape
Out[69]: (200000, 11)
In [70]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp, chunksize=10**4)
1 loop, best of 1: 4.68 s per loop
It took ~5 seconds for 200K rows DF in my test (not the fastest) environment.
Conclusion: use the following trick in order to explicitly specify dtype
for all DF columns of object
dtype when saving DataFrames to Oracle DB. Otherwise it'll be saved as CLOB data type, which requires special treatment and makes it very slow
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes == 'object'].tolist()}
df.to_sql(..., dtype=dtyp)
You can just use method='multi'
and this will boost your data insertion speed.
You can also, adjust the chunksize
as per your need, Depends on your data.
I found this when I tried to write a google cloud function which have ability to load data from csv files/excel into dataframe and I would like to save that dataframe to the postgresql database in google cloud sql.
This is the handy tool to use, if you can create a similar structure in the dataframe as in your database table.
df.to_sql(
'table_name',
con=engine,
if_exists='append',
index=False,
chunksize=2000,
method='multi'
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With