Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sqlalchemy bulk update performance problems

I need to increment values in a column periodically with data I receive in a file. The table has > 400000 rows. So far, all my attempts result in very poor performance. I have written an experiment that reflects my requirements:

#create table
engine = create_engine('sqlite:///bulk_update.db', echo=False)
metadata = MetaData()

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer),
    Column('column2', Integer),
)

sometable.create(engine, checkfirst=True)

#initial population
conn = engine.connect()
nr_of_rows = 50000
insert_data = [ { 'column1': i, 'column2' : 0 } for i in range(1, nr_of_rows)]
result = conn.execute(sometable.insert(), insert_data)

#update
update_data = [ {'col1' : i, '_increment': randint(1, 500)} for i in range(1, nr_of_rows)]

print "nr_of_rows", nr_of_rows
print "start time   : " + str(datetime.time(datetime.now()))

stmt = sometable.update().\
        where(sometable.c.column1 == bindparam('col1')).\
        values({sometable.c.column2 : sometable.c.column2 +     bindparam('_increment')})

conn.execute(stmt, update_data)

print "end time : " + str(datetime.time(datetime.now()))

the times I get are these:

nr_of_rows 10000
start time  : 10:29:01.753938
end time    : 10:29:16.247651

nr_of_rows 50000
start time  : 10:30:35.236852
end time    : 10:36:39.070423

so doing a 400000+ amount of rows will take much too long.

I am new to sqlalchemy, but I did do a lot of doc reading, and I just can't understand what I am doing wrong.

thanks in advance!

like image 720
devboell Avatar asked May 24 '13 09:05

devboell


People also ask

Is SQLAlchemy faster than sqlite?

Interesting to note that querying using bare sqlite3 is still about 3 times faster than using SQLAlchemy Core. I guess that's the price you pay for having a ResultProxy returned instead of a bare sqlite3 row. SQLAlchemy Core is about 8 times faster than using ORM. So querying using ORM is a lot slower no matter what.

Is SQLAlchemy efficient?

SQLAlchemy leverages powerful common statements and types to ensure its SQL statements are crafted efficiently and properly for each database type and vendor without you having to think about it. This makes it easy to migrate logic from Oracle to PostgreSQL or from an application database to a data warehouse.

How does SQLAlchemy update data?

Update table elements in SQLAlchemy. Get the books to table from the Metadata object initialized while connecting to the database. Pass the update query to the execute() function and get all the results using fetchall() function. Use a for loop to iterate through the results.

Does SQLAlchemy cache queries?

SQLAlchemy as of version 1.4 includes a SQL compilation caching facility which will allow Core and ORM SQL constructs to cache their stringified form, along with other structural information used to fetch results from the statement, allowing the relatively expensive string compilation process to be skipped when another ...


1 Answers

You are using the correct approach by doing bulk update with single query.

The reason why it takes that long is because the table doesn't have index on the sometable.column1. It has only primary index on column id.

Your update query uses sometable.column1 in where clause to identify record. So database has to scan through the all table records for every single column update.

To make update run much faster you need to update your table schema definition code to add index creation to the column1 definition with , index=True:

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer, index=True),
    Column('column2', Integer),
)

I tested updated code in my machine - it took <2 seconds for the program to run.

BTW kudos to your question description - you put all code needed to reproduce your problem.

like image 93
vvladymyrov Avatar answered Sep 20 '22 11:09

vvladymyrov