Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sqlalchemy bulk insert is slower than building raw SQL

I'm going through this article on the sqlalchemy bulk insert performance. I tried various approaches specified in the benchmark test - SQLAlchemy ORM bulk_insert_mappings(), SQLAlchemy Core. Unfortunately for inserting 1000 rows all these methods required about 1min to insert them. This is horrendously slow. I tried also the approach specified here - this requires me building a large SQL statement like:

INSERT INTO mytable (col1, col2, col3)
VALUES (1,2,3), (4,5,6) ..... --- up to 1000 of these

And the insert for this raw SQL is something like:

MySession.execute('''
insert into MyTable (e, l, a)
values {}
'''.format(",".join(my_insert_str)))

Using this approach I improved the performance 50x+ times to 10000 insertions in 10-11 seconds.

Here is the code for the approach using the build-in lib.

class MyClass(Base):
    __tablename__ = "MyTable"
    e = Column(String(256), primary_key=True)
    l = Column(String(6))
    a = Column(String(20), primary_key=True)

    def __repr__(self):
        return self.e + " " + self.a+ " " + self.l

.......

        dict_list = []
        for i, row in chunk.iterrows():

            dict_list += [{"e" : row["e"], "l" : l, "a" : a}]

        MySession.execute(
            Myclass.__table__.insert(),
            dict_list
        )

Here is how I connect to the database.

    params = urllib.quote_plus("DRIVER={SQL Server Native Client 10.0};SERVER=servername;DATABASE=dbname;UID=user;PWD=pass")
    engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params )
    MySession.configure(bind=engine, autoflush=False, expire_on_commit=False)

Is there an issue with my set up to degrade the performance so much? I tried with different db drivers - pyodbc and pymssql. What ever I try I cannot any close to the numbers they claim in the article namely:

SQLAlchemy ORM: Total time for 100000 records 2.192882061 secs
SQLAlchemy ORM pk given: Total time for 100000 records 1.41679310799 secs
SQLAlchemy ORM bulk_save_objects(): Total time for 100000 records 0.494568824768 secs
SQLAlchemy ORM bulk_insert_mappings(): Total time for 100000 records 0.325763940811 secs
SQLAlchemy Core: Total time for 100000 records 0.239127874374 secs
sqlite3: Total time for 100000 records 0.124729156494 sec

I'm connecting to MS SQL Server 2008. Let me know if I've missed any other details.

The problem with the raw SQL approach is that it's not SQL injection safe. So alternatively if you have suggestions how to solve this issue it will be also very helpful :).

like image 961
Anton Belev Avatar asked Dec 19 '22 05:12

Anton Belev


1 Answers

You're doing

MySession.execute(
    Myclass.__table__.insert(),
    dict_list
)

which uses executemany(). It is not the same as INSERT INTO ... VALUES .... To use VALUES, do:

MySession.execute(
    Myclass.__table__.insert().values(dict_list)
)

As a side note, the SQL injection problem is solved using parameters:

MySession.execute('''
insert into MyTable (e, l, a)
values (?, ?, ?), (?, ?, ?), ...
''', params)

The takeaway here is that you're not comparing equivalent constructs. You're not using VALUES in the SQLAlchemy-generated query but you are in your textual SQL, and you're not using parameterization in your textual SQL but you are in the SQLAlchemy-generated query. If you turn on logging for the executed SQL statements you'll see exactly what is different.

like image 137
univerio Avatar answered Dec 22 '22 00:12

univerio