sqlalchemy bulk update performance problems

Tags:

I need to increment values in a column periodically with data I receive in a file. The table has > 400000 rows. So far, all my attempts result in very poor performance. I have written an experiment that reflects my requirements:

#create table
engine = create_engine('sqlite:///bulk_update.db', echo=False)
metadata = MetaData()

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer),
    Column('column2', Integer),
)

sometable.create(engine, checkfirst=True)

#initial population
conn = engine.connect()
nr_of_rows = 50000
insert_data = [ { 'column1': i, 'column2' : 0 } for i in range(1, nr_of_rows)]
result = conn.execute(sometable.insert(), insert_data)

#update
update_data = [ {'col1' : i, '_increment': randint(1, 500)} for i in range(1, nr_of_rows)]

print "nr_of_rows", nr_of_rows
print "start time   : " + str(datetime.time(datetime.now()))

stmt = sometable.update().\
        where(sometable.c.column1 == bindparam('col1')).\
        values({sometable.c.column2 : sometable.c.column2 +     bindparam('_increment')})

conn.execute(stmt, update_data)

print "end time : " + str(datetime.time(datetime.now()))

the times I get are these:

nr_of_rows 10000
start time  : 10:29:01.753938
end time    : 10:29:16.247651

nr_of_rows 50000
start time  : 10:30:35.236852
end time    : 10:36:39.070423

so doing a 400000+ amount of rows will take much too long.

I am new to sqlalchemy, but I did do a lot of doc reading, and I just can't understand what I am doing wrong.

thanks in advance!

720

asked May 24 '13 09:05

devboell

1 Answers

You are using the correct approach by doing bulk update with single query.

The reason why it takes that long is because the table doesn't have index on the sometable.column1. It has only primary index on column id.

Your update query uses sometable.column1 in where clause to identify record. So database has to scan through the all table records for every single column update.

To make update run much faster you need to update your table schema definition code to add index creation to the column1 definition with , index=True:

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer, index=True),
    Column('column2', Integer),
)

I tested updated code in my machine - it took <2 seconds for the program to run.

BTW kudos to your question description - you put all code needed to reproduce your problem.

answered Sep 20 '22 11:09

vvladymyrov

Related questions
                            
                                how to make python awaitable object
                            
                                assign dtype with from_dict
                            
                                Sitemap and object with multiple urls
                            
                                Flask SQLAlchemy Data Mapper vs Active Record Pattern
                            
                                Adding a path to sys.path in python and pylint
                            
                                When using tweepy cursor, what is the best practice for catching over capacity errors?
                            
                                Splitting names that include "de", "da", etc. into first, middle, last, etc
                            
                                How to pickle Keras custom layer?
                            
                                Python Flask as Windows Service
                            
                                CUDA initialization: Unexpected error from cudaGetDeviceCount()
                            
                                Python equivalent of std::set and std::multimap
                            
                                Django InlineModelAdmin - set inline field from request on save (set user field automatically) (save_formset vs save_model)
                            
                                Writing a parallel programming framework, what have I missed?
                            
                                Attempted relative import in non-package (after 2to3)
                            
                                Design tips for a program to be run in 25 years [closed]
                            
                                South: run a migration for a column that is both unique and not null
                            
                                how to start django shell with ipython in qtconsole mode?
                            
                                pip: inconsistent permissions issues
                            
                                How to use modern string formatting options with Python's logging module?
                            
                                Why does re.findall() find more matches than re.sub()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sqlalchemy bulk update performance problems

Tags:

python

sqlite

sql-update

sqlalchemy

bulk-load

devboell

People also ask

1 Answers

vvladymyrov

Recent Activity

Donate For Us