Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLAlchemy: Scan huge tables using ORM?

I am currently playing around with SQLAlchemy a bit, which is really quite neat.

For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...

For fun I did the equivalent of a select * over the resulting SQLite database:

session = Session() for p in session.query(Picture):     print(p) 

I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.

Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?

like image 419
Bluehorn Avatar asked Jul 17 '09 22:07

Bluehorn


People also ask

Is SQLAlchemy ORM slow?

SQLAlchemy is very, very fast. It's just that users tend to be unaware of just how much functionality is being delivered, and confuse an ORM result set with that of a raw database cursor.

Should I use SQLAlchemy core or ORM?

If you want to view your data in a more schema-centric view (as used in SQL), use Core. If you have data for which business objects are not needed, use Core. If you view your data as business objects, use ORM. If you are building a quick prototype, use ORM.

Is SQLAlchemy better than Django ORM?

Agree with all previous answers: yes, SQLAlchemy is really more powerful than Django ORM, gives you a real control over your SQL and is very explicit. But 90% of our work is just simple create/read/update/delete or finding object by id.


1 Answers

Okay, I just found a way to do this myself. Changing the code to

session = Session() for p in session.query(Picture).yield_per(5):     print(p) 

loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs

WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten. In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.

So if using yield_per is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?

like image 77
Bluehorn Avatar answered Sep 30 '22 19:09

Bluehorn