Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Bulk insert to insert data from Dataframe to SQL Server table?

I'm new to Python so reaching out for help. I have a csv file in S3 bucket, I would like to use Python pyodbc to import this csv file to a table in SQL server. This file is 50 MB (400k records). My code is below. As my code states below, my csv data is in a dataframe, how can I use Bulk insert to insert dataframe data into sql server table. If my approach does not work, please advise me with a different approach.

# Connection to S3
s3 = boto3.client(
    service_name = 's3',
    region_name = 'us-gov-west-1',
    aws_access_key_id = 'ZZZZZZZZZZZZZZZZZZ',
    aws_secret_access_key = 'AAAAAAAAAAAAAAAAA')

# Connection to SQL Server
server = 'myserver.amazonaws.com'
path = 'folder1/folder2/folder3/myCSVFile.csv'

cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE=DB-staging;UID=User132;PWD=XXXXXX')
    cursor = cnxn.cursor()
    
obj_sum = s3.get_object(Bucket = 'my_bucket', Key = path)
csv_data = pd.read_csv(obj_sum['Body'])
df = pd.DataFrame(csv_data, columns = ['SYSTEM_NAME', 'BUCKET_NAME', 'LOCATION', 'FILE_NAME', 'LAST_MOD_DATE', 'FILE_SIZE'])
#print(df.head(n=15).to_string(index=False))

# Insert DataFrame to table
cursor.execute("""truncate table dbo.table1""")
cursor.execute("""BULK INSERT dbo.table1 FROM """ + .....# what do I put here since data is in dataframe??)

I tried to loop through the dataframe and it took 20 minutes to insert 5k records. Code below. Looping through each record is an option but a poor one. This is why I'm moving towards bulk insert if possible.

for i in df.itertuples(index = False):
    if i.FILE_SIZE != 0:
       cursor.execute("""insert into dbo.table1 (SYSTEM_NAME, BUCKET_NAME, X_LOCATION, FILE_NAME, LAST_MOD_DATE, FILE_SIZE) 
                  values (?,?,?,?,?,?)""", i.SYSTEM_NAME, i.BUCKET_NAME, i.LOCATION, i.FILE_NAME, i.LAST_MOD_DATE, i.FILE_SIZE)

Lastly, bonus question ... I would like to check if the "FILE_SIZE" column in my dataframe equals to 0, if it is skip over that record and move forward to the next record.

Thank you in advnace.

like image 331
AJR Avatar asked Sep 17 '25 03:09

AJR


1 Answers

Thanks for the help.

using fast_executemany = True did the job for me.

engine = sal.create_engine("mssql+pyodbc://username:password@"+server+":1433/db-name?driver=ODBC+Driver+17+for+SQL+Server?Trusted_Connection=yes",
                           fast_executemany = True)
conn = engine.connect()

I had to change my code around to use "sqlalchemy" but it working great now.

To call the function to upload data to SQL Server is below:

df.to_sql(str, con = engine, index = False, if_exists = 'replace')
like image 112
AJR Avatar answered Sep 19 '25 16:09

AJR