I am currently playing around with SQLAlchemy a bit, which is really quite neat.
For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...
For fun I did the equivalent of a select *
over the resulting SQLite database:
session = Session() for p in session.query(Picture): print(p)
I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.
Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?
SQLAlchemy is very, very fast. It's just that users tend to be unaware of just how much functionality is being delivered, and confuse an ORM result set with that of a raw database cursor.
If you want to view your data in a more schema-centric view (as used in SQL), use Core. If you have data for which business objects are not needed, use Core. If you view your data as business objects, use ORM. If you are building a quick prototype, use ORM.
Agree with all previous answers: yes, SQLAlchemy is really more powerful than Django ORM, gives you a real control over your SQL and is very explicit. But 90% of our work is just simple create/read/update/delete or finding object by id.
Okay, I just found a way to do this myself. Changing the code to
session = Session() for p in session.query(Picture).yield_per(5): print(p)
loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs
WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten. In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.
So if using yield_per
is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With