Recently I came across the following issue: How can you iterate over a really big data query in order to do actions (say for every object create two different objects). In case you handle a small queryset this is simple:
for obj in Mymodel.objects.all():
create_corresponding_entries(obj)
Now try doing this in a queryset with 900k objects. Probably your pc will freeze up because it will eat up all memory. So how can I achieve this lazily? The same question happens whether you use Django ORM or SQLAlchemy
Although Django ORM gives a "lazy" Queryset, what I was looking for was a generator that would provide me a way to lazily get my objects. Querysets in django are not really lazy, they are lazy until you try to access them, where the database will hit and fetch you 1M entries. SQLAlchemy does the same. In case you have oracle or postgre database you're lucky and you can use the supported server side cursors. SQLAlchemy also supports for these plus mysql in case you use mysqldb or pymysql dialects. I'm not sure how server side cursors work behind the scenes.
More info for
So in case you don't fit in any of the above cases, you have to figure a way to lazily fetch these objects. Because both Django ORM and SQLAlchemy support slicing by translating this to pure SQL queries, I figured I could use a custom generator to slice me batches of the queries i needed.
Disclaimer: The solution is trying to solve issues in dumping a lot of data locally, it doesn't try to maximize performance in queries or anything performance related to the database.
Warning: This will result in more queries to the databases than the simple Mymodel.objects.all()
but will challenge your RAM less.
def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
counter = start
while counter < max_count:
yield fetch_func()[counter:counter + max_obj]
counter += max_obj
and then to use it for example:
fetcher = lazy_bulk_fetch(50, Mymodel.objects.count(), lambda: Mymodel.objects.order_by('id'))
for batch in fetcher:
make_actions(batch)
this will fetch me for each iteration a list of 50 objects until I reach the maximum count I want. If you change make_actions(batch)
with print(batch.query)
in django you'll see something like the following:
SELECT "services_service"."id" FROM "services_service" LIMIT 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 100
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 150
The same concept can be used with slice
that SQLAlchemy supports. The solution in this case would be the same but instead of python slicing you would use the slice
function of SQLAlchemy Query object
EDIT: From what I saw SQLAlchemy Query class implements the __getitem__
function. So for SQLAlchemy you could use the exact same function I suggested for Django. If you want to explicitly use the slice
function you would end up in something like the following:
def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
counter = start
while counter < max_count:
yield fetch_func().slice(counter, counter + max_obj)
counter += max_obj
in any case you would call it like this:
from sqlalchemy import func
fetcher = lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)),
lambda: session.query(Mymodel).order_by(Mymodel.id))
Two notes here:
func.count
in order for this to be translated to a COUNT
SQL statement in the server. If you use len(session.query(Mymodel))
you'll dump everything locally, find it's length and then throw it awayI use the lambda
so that the implementation is like the django one. I could also have
lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)),
session.query(Mymodel).order_by(Mymodel.id))
but then I would have to have in my function
yield fetch_func.slice(counter, counter + max_obj)
EDIT #2: I added ordering since otherwise you cannot be sure that you wont get the same results in the Nth run. Ordering guarantees that you will get unique results. It's better to have the id as the ordering key, otherwise you cannot be sure that you miss a result (because during the Nth hit, a new entry might have been added and ordering without the id could result in you missing it or getting double entries)
I do not know if I misunderstood your question or if the answers are before current versions of Django, but for Django see: https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator
for i in Mymodel.objects.iterator(chunk_size=2000):
print(i)
As in the docs for some databases it is implemented with cursors on the RDBMS on some others with some tricks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With