Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?

Tags:

Why is this simple test case inserting 100,000 rows 25 times slower with SQLAlchemy than it is using the sqlite3 driver directly? I have seen similar slowdowns in real-world applications. Am I doing something wrong?

#!/usr/bin/env python # Why is SQLAlchemy with SQLite so slow? # Output from this program: # SqlAlchemy: Total time for 100000 records 10.74 secs # sqlite3:    Total time for 100000 records  0.40 secs   import time import sqlite3  from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Integer, String,  create_engine  from sqlalchemy.orm import scoped_session, sessionmaker  Base = declarative_base() DBSession = scoped_session(sessionmaker())  class Customer(Base):     __tablename__ = "customer"     id = Column(Integer, primary_key=True)     name = Column(String(255))  def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):     engine  = create_engine(dbname, echo=False)     DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)     Base.metadata.drop_all(engine)     Base.metadata.create_all(engine)  def test_sqlalchemy(n=100000):     init_sqlalchemy()     t0 = time.time()     for i in range(n):         customer = Customer()         customer.name = 'NAME ' + str(i)         DBSession.add(customer)     DBSession.commit()     print "SqlAlchemy: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"  def init_sqlite3(dbname):     conn = sqlite3.connect(dbname)     c = conn.cursor()     c.execute("DROP TABLE IF EXISTS customer")     c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")     conn.commit()     return conn  def test_sqlite3(n=100000, dbname = 'sqlite3.db'):     conn = init_sqlite3(dbname)     c = conn.cursor()     t0 = time.time()     for i in range(n):         row = ('NAME ' + str(i),)         c.execute("INSERT INTO customer (name) VALUES (?)", row)     conn.commit()     print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"  if __name__ == '__main__':     test_sqlalchemy(100000)     test_sqlite3(100000)

I have tried numerous variations (see http://pastebin.com/zCmzDraU )

620

asked Aug 02 '12 00:08

braddock

2 Answers

The SQLAlchemy ORM uses the unit of work pattern when synchronizing changes to the database. This pattern goes far beyond simple "inserts" of data. It includes that attributes which are assigned on objects are received using an attribute instrumentation system which tracks changes on objects as they are made, includes that all rows inserted are tracked in an identity map which has the effect that for each row SQLAlchemy must retrieve its "last inserted id" if not already given, and also involves that rows to be inserted are scanned and sorted for dependencies as needed. Objects are also subject to a fair degree of bookkeeping in order to keep all of this running, which for a very large number of rows at once can create an inordinate amount of time spent with large data structures, hence it's best to chunk these.

Basically, unit of work is a large degree of automation in order to automate the task of persisting a complex object graph into a relational database with no explicit persistence code, and this automation has a price.

So ORMs are basically not intended for high-performance bulk inserts. This is the whole reason why SQLAlchemy has two separate libraries, which you'll note if you look at http://docs.sqlalchemy.org/en/latest/index.html you'll see two distinct halves to the index page - one for the ORM and one for the Core. You cannot use SQLAlchemy effectively without understanding both.

For the use case of fast bulk inserts, SQLAlchemy provides the core, which is the SQL generation and execution system that the ORM builds on top of. Using this system effectively we can produce an INSERT that is competitive with the raw SQLite version. The script below illustrates this, as well as an ORM version that pre-assigns primary key identifiers so that the ORM can use executemany() to insert rows. Both ORM versions chunk the flushes at 1000 records at a time as well which has a significant performance impact.

Runtimes observed here are:

SqlAlchemy ORM: Total time for 100000 records 16.4133379459 secs SqlAlchemy ORM pk given: Total time for 100000 records 9.77570986748 secs SqlAlchemy Core: Total time for 100000 records 0.568737983704 secs sqlite3: Total time for 100000 records 0.595796823502 sec

script:

import time import sqlite3  from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Integer, String,  create_engine from sqlalchemy.orm import scoped_session, sessionmaker  Base = declarative_base() DBSession = scoped_session(sessionmaker())  class Customer(Base):     __tablename__ = "customer"     id = Column(Integer, primary_key=True)     name = Column(String(255))  def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):     global engine     engine = create_engine(dbname, echo=False)     DBSession.remove()     DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)     Base.metadata.drop_all(engine)     Base.metadata.create_all(engine)  def test_sqlalchemy_orm(n=100000):     init_sqlalchemy()     t0 = time.time()     for i in range(n):         customer = Customer()         customer.name = 'NAME ' + str(i)         DBSession.add(customer)         if i % 1000 == 0:             DBSession.flush()     DBSession.commit()     print "SqlAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"  def test_sqlalchemy_orm_pk_given(n=100000):     init_sqlalchemy()     t0 = time.time()     for i in range(n):         customer = Customer(id=i+1, name="NAME " + str(i))         DBSession.add(customer)         if i % 1000 == 0:             DBSession.flush()     DBSession.commit()     print "SqlAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"  def test_sqlalchemy_core(n=100000):     init_sqlalchemy()     t0 = time.time()     engine.execute(         Customer.__table__.insert(),         [{"name":'NAME ' + str(i)} for i in range(n)]     )     print "SqlAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"  def init_sqlite3(dbname):     conn = sqlite3.connect(dbname)     c = conn.cursor()     c.execute("DROP TABLE IF EXISTS customer")     c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")     conn.commit()     return conn  def test_sqlite3(n=100000, dbname = 'sqlite3.db'):     conn = init_sqlite3(dbname)     c = conn.cursor()     t0 = time.time()     for i in range(n):         row = ('NAME ' + str(i),)         c.execute("INSERT INTO customer (name) VALUES (?)", row)     conn.commit()     print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"  if __name__ == '__main__':     test_sqlalchemy_orm(100000)     test_sqlalchemy_orm_pk_given(100000)     test_sqlalchemy_core(100000)     test_sqlite3(100000)

See also: http://docs.sqlalchemy.org/en/latest/faq/performance.html

198

answered Oct 12 '22 23:10

zzzeek

Excellent answer from @zzzeek. For those wondering about the same stats for queries I've modified @zzzeek code slightly to query those same records right after inserting them then convert those records to a list of dicts.

Here's the results

SqlAlchemy ORM: Total time for 100000 records 11.9210000038 secs SqlAlchemy ORM query: Total time for 100000 records 2.94099998474 secs SqlAlchemy ORM pk given: Total time for 100000 records 7.51800012589 secs SqlAlchemy ORM pk given query: Total time for 100000 records 3.07699990273 secs SqlAlchemy Core: Total time for 100000 records 0.431999921799 secs SqlAlchemy Core query: Total time for 100000 records 0.389000177383 secs sqlite3: Total time for 100000 records 0.459000110626 sec sqlite3 query: Total time for 100000 records 0.103999853134 secs

Interesting to note that querying using bare sqlite3 is still about 3 times faster than using SQLAlchemy Core. I guess that's the price you pay for having a ResultProxy returned instead of a bare sqlite3 row.

SQLAlchemy Core is about 8 times faster than using ORM. So querying using ORM is a lot slower no matter what.

Here's the code I used:

import time import sqlite3  from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Integer, String,  create_engine from sqlalchemy.orm import scoped_session, sessionmaker from sqlalchemy.sql import select  Base = declarative_base() DBSession = scoped_session(sessionmaker())  class Customer(Base):     __tablename__ = "customer"     id = Column(Integer, primary_key=True)     name = Column(String(255))  def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):     global engine     engine = create_engine(dbname, echo=False)     DBSession.remove()     DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)     Base.metadata.drop_all(engine)     Base.metadata.create_all(engine)  def test_sqlalchemy_orm(n=100000):     init_sqlalchemy()     t0 = time.time()     for i in range(n):         customer = Customer()         customer.name = 'NAME ' + str(i)         DBSession.add(customer)         if i % 1000 == 0:             DBSession.flush()     DBSession.commit()     print "SqlAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"     t0 = time.time()     q = DBSession.query(Customer)     dict = [{'id':r.id, 'name':r.name} for r in q]     print "SqlAlchemy ORM query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"   def test_sqlalchemy_orm_pk_given(n=100000):     init_sqlalchemy()     t0 = time.time()     for i in range(n):         customer = Customer(id=i+1, name="NAME " + str(i))         DBSession.add(customer)         if i % 1000 == 0:             DBSession.flush()     DBSession.commit()     print "SqlAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"     t0 = time.time()     q = DBSession.query(Customer)     dict = [{'id':r.id, 'name':r.name} for r in q]     print "SqlAlchemy ORM pk given query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"  def test_sqlalchemy_core(n=100000):     init_sqlalchemy()     t0 = time.time()     engine.execute(         Customer.__table__.insert(),         [{"name":'NAME ' + str(i)} for i in range(n)]     )     print "SqlAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"     conn = engine.connect()     t0 = time.time()     sql = select([Customer.__table__])     q = conn.execute(sql)     dict = [{'id':r[0], 'name':r[0]} for r in q]     print "SqlAlchemy Core query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"  def init_sqlite3(dbname):     conn = sqlite3.connect(dbname)     c = conn.cursor()     c.execute("DROP TABLE IF EXISTS customer")     c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")     conn.commit()     return conn  def test_sqlite3(n=100000, dbname = 'sqlite3.db'):     conn = init_sqlite3(dbname)     c = conn.cursor()     t0 = time.time()     for i in range(n):         row = ('NAME ' + str(i),)         c.execute("INSERT INTO customer (name) VALUES (?)", row)     conn.commit()     print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"     t0 = time.time()     q = conn.execute("SELECT * FROM customer").fetchall()     dict = [{'id':r[0], 'name':r[0]} for r in q]     print "sqlite3 query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"   if __name__ == '__main__':     test_sqlalchemy_orm(100000)     test_sqlalchemy_orm_pk_given(100000)     test_sqlalchemy_core(100000)     test_sqlite3(100000)

I also tested without converting the query result to dicts and the stats are similar:

SqlAlchemy ORM: Total time for 100000 records 11.9189999104 secs SqlAlchemy ORM query: Total time for 100000 records 2.78500008583 secs SqlAlchemy ORM pk given: Total time for 100000 records 7.67199993134 secs SqlAlchemy ORM pk given query: Total time for 100000 records 2.94000005722 secs SqlAlchemy Core: Total time for 100000 records 0.43700003624 secs SqlAlchemy Core query: Total time for 100000 records 0.131000041962 secs sqlite3: Total time for 100000 records 0.500999927521 sec sqlite3 query: Total time for 100000 records 0.0859999656677 secs

Querying with SQLAlchemy Core is about 20 times faster compared to ORM.

Important to note that those tests are very superficial and should not be taken too seriously. I might be missing some obvious tricks that could change the stats completely.

The best way to measure performance improvements is directly in your own application. Don't take my stats for granted.

answered Oct 12 '22 23:10

Alex

Related questions
                            
                                How do you plot a vertical line on a time series plot in Pandas?
                            
                                How to set self.maxDiff in nose to get full diff output?
                            
                                Replace first occurrence only of a string?
                            
                                python zipfile module doesn't seem to be compressing my files
                            
                                Python object deleting itself
                            
                                In Python NumPy what is a dimension and axis?
                            
                                SMTPAuthenticationError when sending mail using gmail and python [duplicate]
                            
                                How to strip comma in Python string
                            
                                python: urllib2 how to send cookie with urlopen request
                            
                                Does Python have anonymous classes?
                            
                                Django: signal when user logs in?
                            
                                What is StringIO in python used for in reality?
                            
                                start index at 1 for Pandas DataFrame
                            
                                Read file content from S3 bucket with boto3
                            
                                Overriding "+=" in Python? (__iadd__() method)
                            
                                timeit versus timing decorator
                            
                                How to one-hot-encode from a pandas column containing a list?
                            
                                Pip error: Microsoft Visual C++ 14.0 is required
                            
                                Python slice first and last element in list
                            
                                Can you have variables within triple quotes? If so, how?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?

Tags:

python

sqlite

orm

sqlalchemy

braddock

People also ask

2 Answers

zzzeek

Alex

Recent Activity

Donate For Us