The problem is trying to upload data to a SQL Server and getting speeds of 122 rows per second (17 columns). I decided to post the problem here along with the workaround in the hopes someone knows the definitive answer.
The most relevant thread I found was but the problem differs significantly and still with no answer: pyodbc - very slow bulk insert speed
It's a simple scenario in which I try to upload a CSV of 350K rows into a blank SQL Server table using Python. After trying one of the most popular ways, that is, read it as a pandas DataFrame, create a sql_alchemy engine with fast_executemany=True and use the to_sql() method to store into the database. I got 122 rows / second, which is unacceptable.
As mentioned in other threads, this doesn't happen in PostgreSQL or Oracle and I can add that neither does it happen in MariaDB. So I've tried a different approach, using the pyodbc cursor.executemany() to see if there was a bug in pandas or sql_alchemy. Same speed.
The next step was to generate synthetic data to replicate the problem to submit a bug... and to my surprise the generated data was around 8000 records / second. WTF? The data used the same data type (obviously) as the one in the CSV.
After weeks of trying different things, I decided to look into pydobc itself. In pyodbc github dev site, I found an interesting piece of information at https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters, particularly in the Writing NULL and in the Solutions and Workarounds sections.
Indeed, 3 of the 17 fields on the first line of the CSV were converted to 'Nan' in Pandas or in None manually by me. To my surprise after, replacing these None/Nan/NULL for valid values on the FIRST LINE ONLY, boosted the speed to 7-8000 records/s. Note that I didn't change any of the None/Nan in the subsequent lines, only on the first one.
Does anyone understand why do this happen? Is there a more elegant fix than to switch to replace None/Nan to a valid value?
UPDATE: It seems there a couple of related issues on Github page and all pointing to this same issue. For reference: https://github.com/mkleehammer/pyodbc/issues/213. The thread is relatively old, from 2017 but it seems the problem in how to deal with None/Nan still persist.
There's a bug in pyodbc at least up to version 4.0.30 when talking with Microsoft SQL Server. In summary, SQL Server uses different types of NULL for different field types and pyodbc can't infer which NULL to use just from a 'None'. To overcome this limitation, pyodbc implemented two approaches:
By default, when a None is found on the first line, the parameter is bound to BINARY. Every time a different type is found for the same field, it re-detects and tries to re-bind but does that for every subsequent line after the first bind, causing the drop in performance.
Passing the type of the field to the pyodbc.cursor using the method .setinputsizes() should avoid this problem at all but right now the .setinputsizes() is ignored when it finds a 'None' in the first line.
pyodbc team is aware of the issue and will be working in a fix in future versions. More information on this bug on https://github.com/mkleehammer/pyodbc/issues/741
Currently, the only effective workaround is to create a dummy record as the first row (to be removed after the insertion is completed) with a representative value for the type so that pyodbc can correctly bind the right type.
This problem affects all packages that use pyodbc, including SQL Alchemy and, indirectly, pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With