Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bulk fetch model objects from database handled by django/sqlalchemy

Recently I came across the following issue: How can you iterate over a really big data query in order to do actions (say for every object create two different objects). In case you handle a small queryset this is simple:

for obj in Mymodel.objects.all():
    create_corresponding_entries(obj)

Now try doing this in a queryset with 900k objects. Probably your pc will freeze up because it will eat up all memory. So how can I achieve this lazily? The same question happens whether you use Django ORM or SQLAlchemy

like image 814
John Paraskevopoulos Avatar asked Sep 16 '25 16:09

John Paraskevopoulos


2 Answers

Although Django ORM gives a "lazy" Queryset, what I was looking for was a generator that would provide me a way to lazily get my objects. Querysets in django are not really lazy, they are lazy until you try to access them, where the database will hit and fetch you 1M entries. SQLAlchemy does the same. In case you have oracle or postgre database you're lucky and you can use the supported server side cursors. SQLAlchemy also supports for these plus mysql in case you use mysqldb or pymysql dialects. I'm not sure how server side cursors work behind the scenes.

More info for

  • Django ORM: https://www.niwi.nz/2012/10/22/server-side-cursors-with-postgresql-and-django/
  • SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/query.html#sqlalchemy.orm.query.Query.yield_per

So in case you don't fit in any of the above cases, you have to figure a way to lazily fetch these objects. Because both Django ORM and SQLAlchemy support slicing by translating this to pure SQL queries, I figured I could use a custom generator to slice me batches of the queries i needed.

Disclaimer: The solution is trying to solve issues in dumping a lot of data locally, it doesn't try to maximize performance in queries or anything performance related to the database.

Warning: This will result in more queries to the databases than the simple Mymodel.objects.all() but will challenge your RAM less.

def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
    counter = start
    while counter < max_count:
        yield fetch_func()[counter:counter + max_obj]
        counter += max_obj

and then to use it for example:

fetcher = lazy_bulk_fetch(50, Mymodel.objects.count(), lambda: Mymodel.objects.order_by('id'))
for batch in fetcher:
    make_actions(batch)

this will fetch me for each iteration a list of 50 objects until I reach the maximum count I want. If you change make_actions(batch) with print(batch.query) in django you'll see something like the following:

SELECT "services_service"."id" FROM "services_service" LIMIT 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 100
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 150

The same concept can be used with slice that SQLAlchemy supports. The solution in this case would be the same but instead of python slicing you would use the slice function of SQLAlchemy Query object

EDIT: From what I saw SQLAlchemy Query class implements the __getitem__ function. So for SQLAlchemy you could use the exact same function I suggested for Django. If you want to explicitly use the slice function you would end up in something like the following:

def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
    counter = start
    while counter < max_count:
        yield fetch_func().slice(counter, counter + max_obj)
        counter += max_obj

in any case you would call it like this:

from sqlalchemy import func
fetcher = lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)), 
                          lambda: session.query(Mymodel).order_by(Mymodel.id))

Two notes here:

  1. You want to use func.count in order for this to be translated to a COUNT SQL statement in the server. If you use len(session.query(Mymodel)) you'll dump everything locally, find it's length and then throw it away
  2. I use the lambda so that the implementation is like the django one. I could also have

    lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)), 
                    session.query(Mymodel).order_by(Mymodel.id)) 
    

    but then I would have to have in my function

    yield fetch_func.slice(counter, counter + max_obj)
    

EDIT #2: I added ordering since otherwise you cannot be sure that you wont get the same results in the Nth run. Ordering guarantees that you will get unique results. It's better to have the id as the ordering key, otherwise you cannot be sure that you miss a result (because during the Nth hit, a new entry might have been added and ordering without the id could result in you missing it or getting double entries)

like image 195
John Paraskevopoulos Avatar answered Sep 19 '25 04:09

John Paraskevopoulos


I do not know if I misunderstood your question or if the answers are before current versions of Django, but for Django see: https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

for i in Mymodel.objects.iterator(chunk_size=2000):
    print(i)

As in the docs for some databases it is implemented with cursors on the RDBMS on some others with some tricks.

like image 32
Dimitrios Mistriotis Avatar answered Sep 19 '25 05:09

Dimitrios Mistriotis