How to bulk fetch model objects from database handled by django/sqlalchemy

Question

Recently I came across the following issue: How can you iterate over a really big data query in order to do actions (say for every object create two different objects). In case you handle a small queryset this is simple:

for obj in Mymodel.objects.all():
    create_corresponding_entries(obj)

Now try doing this in a queryset with 900k objects. Probably your pc will freeze up because it will eat up all memory. So how can I achieve this lazily? The same question happens whether you use Django ORM or SQLAlchemy

John Paraskevopoulos · Accepted Answer

Although Django ORM gives a "lazy" Queryset, what I was looking for was a generator that would provide me a way to lazily get my objects. Querysets in django are not really lazy, they are lazy until you try to access them, where the database will hit and fetch you 1M entries. SQLAlchemy does the same. In case you have oracle or postgre database you're lucky and you can use the supported server side cursors. SQLAlchemy also supports for these plus mysql in case you use mysqldb or pymysql dialects. I'm not sure how server side cursors work behind the scenes.

More info for

Django ORM: https://www.niwi.nz/2012/10/22/server-side-cursors-with-postgresql-and-django/
SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/query.html#sqlalchemy.orm.query.Query.yield_per

So in case you don't fit in any of the above cases, you have to figure a way to lazily fetch these objects. Because both Django ORM and SQLAlchemy support slicing by translating this to pure SQL queries, I figured I could use a custom generator to slice me batches of the queries i needed.

Disclaimer: The solution is trying to solve issues in dumping a lot of data locally, it doesn't try to maximize performance in queries or anything performance related to the database.

Warning: This will result in more queries to the databases than the simple Mymodel.objects.all() but will challenge your RAM less.

def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
    counter = start
    while counter < max_count:
        yield fetch_func()[counter:counter + max_obj]
        counter += max_obj

and then to use it for example:

fetcher = lazy_bulk_fetch(50, Mymodel.objects.count(), lambda: Mymodel.objects.order_by('id'))
for batch in fetcher:
    make_actions(batch)

this will fetch me for each iteration a list of 50 objects until I reach the maximum count I want. If you change make_actions(batch) with print(batch.query) in django you'll see something like the following:

SELECT "services_service"."id" FROM "services_service" LIMIT 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 100
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 150

The same concept can be used with slice that SQLAlchemy supports. The solution in this case would be the same but instead of python slicing you would use the slice function of SQLAlchemy Query object

EDIT: From what I saw SQLAlchemy Query class implements the __getitem__ function. So for SQLAlchemy you could use the exact same function I suggested for Django. If you want to explicitly use the slice function you would end up in something like the following:

def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
    counter = start
    while counter < max_count:
        yield fetch_func().slice(counter, counter + max_obj)
        counter += max_obj

in any case you would call it like this:

from sqlalchemy import func
fetcher = lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)), 
                          lambda: session.query(Mymodel).order_by(Mymodel.id))

Two notes here:

You want to use func.count in order for this to be translated to a COUNT SQL statement in the server. If you use len(session.query(Mymodel)) you'll dump everything locally, find it's length and then throw it away

I use the lambda so that the implementation is like the django one. I could also have

lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)), 
                session.query(Mymodel).order_by(Mymodel.id))

but then I would have to have in my function

yield fetch_func.slice(counter, counter + max_obj)

EDIT #2: I added ordering since otherwise you cannot be sure that you wont get the same results in the Nth run. Ordering guarantees that you will get unique results. It's better to have the id as the ordering key, otherwise you cannot be sure that you miss a result (because during the Nth hit, a new entry might have been added and ordering without the id could result in you missing it or getting double entries)

Dimitrios Mistriotis · Answer

I do not know if I misunderstood your question or if the answers are before current versions of Django, but for Django see: https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

for i in Mymodel.objects.iterator(chunk_size=2000):
    print(i)

As in the docs for some databases it is implemented with cursors on the RDBMS on some others with some tricks.

How to bulk fetch model objects from database handled by django/sqlalchemy

Tags:

python

django

sqlalchemy

lazy-loading

John Paraskevopoulos

2 Answers

John Paraskevopoulos

Dimitrios Mistriotis

Recent Activity

Donate For Us

How to bulk fetch model objects from database handled by django/sqlalchemy

Tags:

python

django

sqlalchemy

lazy-loading

John Paraskevopoulos

2 Answers

John Paraskevopoulos

Dimitrios Mistriotis

Related questions

Recent Activity

Donate For Us