I'm using Flask-SQLAlchemy to do a rather large bulk insert of 60k rows. I also have a many-to-many relationship on this table, so I can't use db.engine.execute
for this. Before inserting, I need to find similar items in the database, and change the insert to an update if a duplicate item is found.
I could do this check beforehand, and then do a bulk insert via db.engine.execute
, but I need the primary key of the row upon insertion.
Currently, I am doing a db.session.add()
and db.session.commit()
on each insert, and I get a measly 3-4 inserts per second.
I ran a profiler to see where the bottleneck is, and it seems that the db.session.commit()
is taking 60% of the time.
Is there some way that would allow me to make this operation faster, perhaps by grouping commits, but which would give me primary keys back?
This is what my models looks like:
class Item(db.Model):
id = db.Column(db.Integer, primary_key=True)
title = db.Column(db.String(1024), nullable=True)
created = db.Column(db.DateTime())
tags_relationship = db.relationship('Tag', secondary=tags, backref=db.backref('items', lazy='dynamic'))
tags = association_proxy('tags_relationship', 'text')
class Tag(db.Model):
id = db.Column(db.Integer, primary_key=True)
text = db.Column(db.String(255))
My insert operation is:
for item in items:
if duplicate:
update_existing_item
else:
x = Item()
x.title = "string"
x.created = datetime.datetime.utcnow()
for tag in tags:
if not tag_already_exists:
y = Tag()
y.text = "tagtext"
x.tags_relationship.append(y)
db.session.add(y)
db.session.commit()
else:
x.tags_relationship.append(existing_tag)
db.session.add(x)
db.session.commit()
Perhaps you should try to db.session.flush()
to send the data to the server, which means any primary keys will be generated. At the end you can db.session.commit()
to actually commit the transaction.
I use the following code to quickly read the content of a pandas DataFrame into SQLite. Note that it circumvents the ORM features of SQLAlchemy. myClass in this context is a db.Model derived class that has a tablename assigned to it. As the code snippets mentions, I adapted
l = df.to_dict('records')
# bulk save the dictionaries, circumventing the slow ORM interface
# c.f. https://gist.github.com/shrayasr/5df96d5bc287f3a2faa4
connection.engine.execute(
myClass.__table__.insert(),
l
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With