Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python ETL - Batch or Iteratively load Large datasets into Oracle Database with cx_Oracle

Using Python to load a dataset of 10MM records into Oracle Database table. Dataframe created with no issue. When loading the dataframe record count too large error from cx_Oralce.

Seeking to loop over the dataframe and batch load the 10MM records by inserting hem 100k records at a time.

Code shwon below works, but only for small datasets that fit in allocated memory. I need one that works for batches and large datasets

Have tried iterating over rows, but this takes very long. Also have tried loading a much smaller data frame - this works but does not achieve the goal.

Also tried using Bindarray and array size to lad the dataframe but nothing worked.

import pandas as pd
import datetime
import sys
import re
from itertools import groupby, islice, takewhile
import cx_Oracle

format = '%y_%m_%d'

TODAY = datetime.date.today()
add = datetime.timedelta(days=1)
yesterday = datetime.date.today() - add
dd = datetime.date.strftime(TODAY,format)

# connection variables
connection = cx_Oracle.connect("user/Oracle_database_connect_info")
cur = connection.cursor()

# dataframe headers
columns = ['C1','C2','C3','C4']

# -- >> test and sample the file
csv_df = pd.read_csv(r'csvfile_location')

# add record_id for values
csv_df_idx = csv_df.index.values +1
csv_df.insert(0,column = 'RECORD_ID' , value=csv_df_idx)


### TABLE ALREADY CREATED IN DATABASE ###


for index, row in csv_df.iterrows():
    ### Insert and Iterate to inset records
    ### Convert to list for easy load into DB
    csv_df_dataset_lst = csv_df.values.tolist()
    insert_statement = """
    INSERT INTO TEST_LOAD
        ( RECORD_ID ,C1 ,C2 ,C3 ,C4)values (:1,:2,:3,:4,:5)    """

    # control number of records to bind for insert
    # cur.bindarraysize = 100000 # --->>> did not work
    # cur.arraysize = 100000 # --->>> did not work
    cur.executemany(insert_statement,csv_df_dataset_lst)
    connection.commit()
connection.close()
like image 898
BGDev Avatar asked Oct 28 '25 00:10

BGDev


1 Answers

Figured it out. The trick was to write a function that chunked the data frame into segments based on the size of the batches one wants to load.

Below is the final code.

import pandas as pd
import numpy as np
import datetime
import sys
import re
from itertools import groupby, islice, takewhile
import cx_Oracle

format = '%y_%m_%d'

TODAY = datetime.date.today()
add = datetime.timedelta(days=1)
yesterday = datetime.date.today() - add
dd = datetime.date.strftime(TODAY,format)

# connection variables
connection = cx_Oracle.connect("user/Oracle_database_connect_info")
cur = connection.cursor()

# dataframe headers
columns = ['C1','C2','C3','C4']

# -- >> test and sample the file
csv_df = pd.read_csv(r'csvfile_location')

# add record_id for values
csv_df_idx = csv_df.index.values +1
csv_df.insert(0,column = 'RECORD_ID' , value=csv_df_idx)


### TABLE ALREADY CREATED IN DATABASE ###


# set batch size ie record count
batch_size = 100000

# create chunker function to separate the dataframe into batches
# Note: last batch will contain smallest amout of records.
def chunker(seq,size):
    return(seq[pos:pos+size] for pos in range(0,len(seq),size))


insert_statement = """
    INSERT INTO TEST_LOAD
        ( RECORD_ID ,C1 ,C2 ,C3 ,C4)values (:1,:2,:3,:4,:5)    """

# Optional use cursor.prepare so Oracle DB avoids compiling the insert statement over and over
try:
    cur.prepare(insert_statement)
except cx_Oracle.DatabaseError as Exception:
    printf('Failed to prepare insert cursor')
    printException(Exception)
    exit(1)

for i in chunker(csv_df,batch_size):
    ### Insert and Iterate to inset records
    ### Convert to list for easy load into DB
    csv_df_dataset_lst = csv_df.values.tolist()

    cur.executemany(insert_statement,csv_df_dataset_lst)
    connection.commit()
    # record counter to monitor the loading.
    number_of_records_loaded = cur.execute("""SELECT COUNT(*), SYSDATE FROM TEST_LOAD GROUP BY SYSDATE""")
    record_out = cur.fetchall()
    for row in record_out:
        print(row)
connection.close()
like image 149
BGDev Avatar answered Oct 30 '25 15:10

BGDev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!