In python
, I have a process to select data from one database (Redshift
via psycopg2
), then insert that data into SQL Server
(via pyodbc
). I chose to do a read / write rather than a read / flat file / load because the row count is around 100,000 per day. Seemed easier to simply connect and insert. However - the insert process is slow, taking several minutes.
Is there a better way to insert data into SQL Server with Pyodbc?
select_cursor.execute(output_query)
done = False
rowcount = 0
while not done:
rows = select_cursor.fetchmany(10000)
insert_list = []
if rows == []:
done = True
break
for row in rows:
rowcount += 1
insert_params = (
row[0],
row[1],
row[2]
)
insert_list.append(insert_params)
insert_cnxn = pyodbc.connect('''Connection Information''')
insert_cursor = insert_cnxn.cursor()
insert_cursor.executemany("""
INSERT INTO Destination (AccountNumber, OrderDate, Value)
VALUES (?, ?, ?)
""", insert_list)
insert_cursor.commit()
insert_cursor.close()
insert_cnxn.close()
select_cursor.close()
select_cnxn.close()
I know that an INSERT on a SQL table can be slow for any number of reasons: Existence of INSERT TRIGGERs on the table. Lots of enforced constraints that have to be checked (usually foreign keys) Page splits in the clustered index when a row is inserted in the middle of the table.
¶ Both 'Bulk insert with batch size' and 'Use single record insert' options are used for inserting records in a database table. The 'Bulk insert with batch size' option is used when you want the whole dataset to be loaded in batches of a specified size. Typically, larger batch sizes result in better transfer speeds.
High Level Process for Using BULK INSERT in a Python Program Assemble the CREATE TABLE command for the table into which the data will be imported. Execute the CREATE TABLE command from within your Python program using a cursor. Assemble the BULK INSERT command for the file to be imported.
UPDATE: pyodbc 4.0.19 added a Cursor#fast_executemany
option that can greatly improve performance by avoiding the behaviour described below. See this answer for details.
Your code does follow proper form (aside from the few minor tweaks mentioned in the other answer), but be aware that when pyodbc performs an .executemany
what it actually does is submit a separate sp_prepexec
for each individual row. That is, for the code
sql = "INSERT INTO #Temp (id, txtcol) VALUES (?, ?)"
params = [(1, 'foo'), (2, 'bar'), (3, 'baz')]
crsr.executemany(sql, params)
the SQL Server actually performs the following (as confirmed by SQL Profiler)
exec sp_prepexec @p1 output,N'@P1 bigint,@P2 nvarchar(3)',N'INSERT INTO #Temp (id, txtcol) VALUES (@P1, @P2)',1,N'foo'
exec sp_prepexec @p1 output,N'@P1 bigint,@P2 nvarchar(3)',N'INSERT INTO #Temp (id, txtcol) VALUES (@P1, @P2)',2,N'bar'
exec sp_prepexec @p1 output,N'@P1 bigint,@P2 nvarchar(3)',N'INSERT INTO #Temp (id, txtcol) VALUES (@P1, @P2)',3,N'baz'
So, for an .executemany
"batch" of 10,000 rows you would be
INSERT INTO ...
) 10,000 times.It is possible to have pyodbc send an initial sp_prepare
and then do an .executemany
calling sp_execute
, but the nature of .executemany
is that you still would do 10,000 sp_prepexec
calls, just executing sp_execute
instead of INSERT INTO ...
. That could improve performance if the SQL statement was quite long and complex, but for a short one like the example in your question it probably wouldn't make all that much difference.
One could also get creative and build "table value constructors" as illustrated in this answer, but notice that it is only offered as a "Plan B" when native bulk insert mechanisms are not a feasible solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With