<h3>Background</h3> I have a Django app that allows record insertion via the Django REST Framework. Records will be periodically batch-inserted row-by-row by client applications that interrogate spreadsheets and other databases. The REST API allows these other applications, which handle data transformation, etc, to be abstracted from Django. <h3>Problem</h3> I'd like to decouple the actual record insertion from the API to improve fault tolerance and the potential for scalability. <h3>Suggested Approach</h3> I am considering doing this with Celery, but I've not used it before. I'm considering overriding <code>perform_create()</code> in my existing DRF ModelViewSets (<code>perform_create()</code> was added in DRF 3.0) to create Celery tasks that workers would grab and process in the background. The DRF documentation says that <code>perform_create()</code> should "should save the object instance by calling serializer.save()". I'm wondering whether, in my case, I could ignore this recommendation and instead have my Celery tasks call on the appropriate serializer to perform the object saves. <h3>Example</h3> If for example I've got a couple of models: <pre class="prettyprint"><code>class Book(models.Model): name = models.CharField(max_length=32) class Author(models.Model): surname = models.CharField(max_length=32) </code></pre> And I've got DRF views and serializers for those models: <pre class="prettyprint"><code>class BookSerializer(serializers.ModelSerializer): class Meta: model = Book class AuthorSerializer(serializers.ModelSerializer): class Meta: model = Author class BookViewSet(viewsets.ModelViewSet): queryset = Book.objects.all() serializer_class = Book class AuthorViewSet(viewsets.ModelViewSet): queryset = Author.objects.all() serializer_class = Author </code></pre> Would it be a good idea to override perform_create() in e.g. <code>BookViewSet</code>: <pre class="prettyprint"><code>def perform_create(self, serializer): create_book_task(serializer.data) </code></pre> Where <code>create_book_task</code> is separately something like: <pre class="prettyprint"><code>@shared_task def create_book_task(data): serializer = BookSerializer(data=data) serializer.save() </code></pre> I've not really been able to find any examples of other developers doing something similar or trying to solve the same problem. Am I overcomplicating it? My database is still going to be the limiting factor when it comes to physical insertion, but at least it won't block the API clients from queueing up their data. I am not committed to Celery if it isn't suitable. Is this the best solution, are there obvious problems with it, or are there better alternatives?

I find your approach is sound, Celery is great except for some border cases that can get a little nasty in my experience (but I wouldn't expect to run into that in the use case you outline in the question). However, consider a simplified approach as follows using Redis. It has some pros and cons. In BookViewSet: <pre class="prettyprint"><code>from redis import StrictRedis from rest_framework import viewsets, renderers redis_client = StrictRedis() class BookViewSet(viewsets.ModelViewSet): queryset = Book.objects.all() serializer_class = Book def perform_create(self, serializer): json = renderers.JSONRenderer().render(serializer.data) redis_client.lpush('create_book_task', json) </code></pre> In a separate worker script: <pre class="prettyprint"><code>from django.utils.six import BytesIO from redis import StrictRedis from rest_framework.parsers import JSONParser from myproject import BookSerializer, Book MAX_BATCH_SIZE = 1000 def create_book_task(): bookset = [] for json in redis_client.brpop(('create_book_task',)): stream = BytesIO(json) data = JSONParser().parse(stream) serializer = BookSerializer(data=data) assert serializer.is_valid() bookset.append(serializer.instance) if len(bookset) >= MAX_BATCH_SIZE: break if len(bookset) > 0: Book.objects.bulk_create(bookset) while True: create_book_task() </code></pre> Pros <ul> <li>You don't need to add Celery (Again, love it, but it makes testing a little trickier and can sometimes get a little hairy depending on workloads, configuration, etc)</li> <li>It handles bulk creation, so if you get thousands of books submitted over a short timespan (seconds or less than a second) only a few inserts will be executed on the DB (as opposed to thousands of inserts)</li> </ul> Cons <ul> <li>You're taking care of the low level serialization yourself instead of Celery doing it "magically"</li> <li>You will need to manage the worker script yourself (daemonizing it, maybe packaging it as a management command, taking care of restarts, etc.) instead of handing that off to Celery</li> </ul> Of course the above is a first approach, you might want to make it more generic to reuse for additional models, move the MAX_BATCH_SIZE TO your settings, use pickling instead of JSON or a variety of other adjustments, improvements or design decisions according to your specific needs. In the end, I would probably go along with the approach outlined in my answer, unless there are several other tasks you anticipate will be offloaded to asynchronous processing where the case for using Celery would become much stronger. PS: Since the actual insertion will be done asynchronously consider responding with a <code>202 Accepted</code> response code instead of <code>201 Created</code> (unless this screws up your clients).

What is the best design pattern for batch insertion using the Django REST Framework?

Background

I have a Django app that allows record insertion via the Django REST Framework.

Records will be periodically batch-inserted row-by-row by client applications that interrogate spreadsheets and other databases. The REST API allows these other applications, which handle data transformation, etc, to be abstracted from Django.

Problem

I'd like to decouple the actual record insertion from the API to improve fault tolerance and the potential for scalability.

Suggested Approach

I am considering doing this with Celery, but I've not used it before. I'm considering overriding perform_create() in my existing DRF ModelViewSets (perform_create() was added in DRF 3.0) to create Celery tasks that workers would grab and process in the background.

The DRF documentation says that perform_create() should "should save the object instance by calling serializer.save()". I'm wondering whether, in my case, I could ignore this recommendation and instead have my Celery tasks call on the appropriate serializer to perform the object saves.

Example

If for example I've got a couple of models:

class Book(models.Model):
    name = models.CharField(max_length=32)

class Author(models.Model):
    surname = models.CharField(max_length=32)

And I've got DRF views and serializers for those models:

class BookSerializer(serializers.ModelSerializer):
    class Meta:
        model = Book

class AuthorSerializer(serializers.ModelSerializer):
    class Meta:
        model = Author

class BookViewSet(viewsets.ModelViewSet):
    queryset = Book.objects.all()
    serializer_class = Book

class AuthorViewSet(viewsets.ModelViewSet):
    queryset = Author.objects.all()
    serializer_class = Author

Would it be a good idea to override perform_create() in e.g. BookViewSet:

def perform_create(self, serializer):
    create_book_task(serializer.data)

Where create_book_task is separately something like:

@shared_task
def create_book_task(data):
    serializer = BookSerializer(data=data)
    serializer.save()

I've not really been able to find any examples of other developers doing something similar or trying to solve the same problem. Am I overcomplicating it? My database is still going to be the limiting factor when it comes to physical insertion, but at least it won't block the API clients from queueing up their data. I am not committed to Celery if it isn't suitable. Is this the best solution, are there obvious problems with it, or are there better alternatives?

515

asked Jan 21 '16 17:01

Paul J

1 Answers

I find your approach is sound, Celery is great except for some border cases that can get a little nasty in my experience (but I wouldn't expect to run into that in the use case you outline in the question).

However, consider a simplified approach as follows using Redis. It has some pros and cons.

In BookViewSet:

from redis import StrictRedis
from rest_framework import viewsets, renderers

redis_client = StrictRedis()

class BookViewSet(viewsets.ModelViewSet):
    queryset = Book.objects.all()
    serializer_class = Book

    def perform_create(self, serializer):
        json = renderers.JSONRenderer().render(serializer.data)
        redis_client.lpush('create_book_task', json)

In a separate worker script:

from django.utils.six import BytesIO
from redis import StrictRedis
from rest_framework.parsers import JSONParser
from myproject import BookSerializer, Book

MAX_BATCH_SIZE = 1000

def create_book_task():
    bookset = []
    for json in redis_client.brpop(('create_book_task',)):
       stream = BytesIO(json)
       data = JSONParser().parse(stream)
       serializer = BookSerializer(data=data)
       assert serializer.is_valid()
       bookset.append(serializer.instance)
       if len(bookset) >= MAX_BATCH_SIZE:
           break

    if len(bookset) > 0:
        Book.objects.bulk_create(bookset)

while True:
    create_book_task()

Pros

You don't need to add Celery (Again, love it, but it makes testing a little trickier and can sometimes get a little hairy depending on workloads, configuration, etc)
It handles bulk creation, so if you get thousands of books submitted over a short timespan (seconds or less than a second) only a few inserts will be executed on the DB (as opposed to thousands of inserts)

Cons

You're taking care of the low level serialization yourself instead of Celery doing it "magically"
You will need to manage the worker script yourself (daemonizing it, maybe packaging it as a management command, taking care of restarts, etc.) instead of handing that off to Celery

Of course the above is a first approach, you might want to make it more generic to reuse for additional models, move the MAX_BATCH_SIZE TO your settings, use pickling instead of JSON or a variety of other adjustments, improvements or design decisions according to your specific needs.

In the end, I would probably go along with the approach outlined in my answer, unless there are several other tasks you anticipate will be offloaded to asynchronous processing where the case for using Celery would become much stronger.

PS: Since the actual insertion will be done asynchronously consider responding with a 202 Accepted response code instead of 201 Created (unless this screws up your clients).

191

answered Oct 13 '22 01:10

Sebastian

Related questions
                            
                                Django rest change users password view
                            
                                Django rest framework versioning
                            
                                Django 1.6 transactions to avoid race conditions
                            
                                Django Rest Framework/Angular JS Preflight options request
                            
                                How to enable logging of django rest api CRUD operations in django_admin_log?
                            
                                Celery + RabbitMQ + "A socket error ocurred"
                            
                                Django REST Framework: How to add prefix in URL for versioning
                            
                                Django exceeds maximum Postgres connections
                            
                                Why is this Django (1.6) annotate count so slow?
                            
                                Django - Datamigration for external app
                            
                                Change the order in which Django migrate app during testing
                            
                                How to set primary key, then convert to autofield?
                            
                                triggering different app environments with pyenv-virtualenv
                            
                                django rest framework 3 ImageField send ajax result “No file was submitted.”
                            
                                Django + Gunicorn + nginx yields very poor performance. Can't get even 8 qps
                            
                                Django Foreign Key to_field
                            
                                Get Python type of Django's model field?
                            
                                In Django, how can I inherit a model that's not abstract as if it were abstract, so that I get a single table in the DB?
                            
                                How to invalidate cache_page in Django?
                            
                                Showing server errors in test output during Django StaticLiveServerTestCase?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the best design pattern for batch insertion using the Django REST Framework?

Tags:

django

django-rest-framework

celery