I have a Django app that allows record insertion via the Django REST Framework.
Records will be periodically batch-inserted row-by-row by client applications that interrogate spreadsheets and other databases. The REST API allows these other applications, which handle data transformation, etc, to be abstracted from Django.
I'd like to decouple the actual record insertion from the API to improve fault tolerance and the potential for scalability.
I am considering doing this with Celery, but I've not used it before. I'm considering overriding perform_create()
in my existing DRF ModelViewSets (perform_create()
was added in DRF 3.0) to create Celery tasks that workers would grab and process in the background.
The DRF documentation says that perform_create()
should "should save the object instance by calling serializer.save()". I'm wondering whether, in my case, I could ignore this recommendation and instead have my Celery tasks call on the appropriate serializer to perform the object saves.
If for example I've got a couple of models:
class Book(models.Model):
name = models.CharField(max_length=32)
class Author(models.Model):
surname = models.CharField(max_length=32)
And I've got DRF views and serializers for those models:
class BookSerializer(serializers.ModelSerializer):
class Meta:
model = Book
class AuthorSerializer(serializers.ModelSerializer):
class Meta:
model = Author
class BookViewSet(viewsets.ModelViewSet):
queryset = Book.objects.all()
serializer_class = Book
class AuthorViewSet(viewsets.ModelViewSet):
queryset = Author.objects.all()
serializer_class = Author
Would it be a good idea to override perform_create() in e.g. BookViewSet
:
def perform_create(self, serializer):
create_book_task(serializer.data)
Where create_book_task
is separately something like:
@shared_task
def create_book_task(data):
serializer = BookSerializer(data=data)
serializer.save()
I've not really been able to find any examples of other developers doing something similar or trying to solve the same problem. Am I overcomplicating it? My database is still going to be the limiting factor when it comes to physical insertion, but at least it won't block the API clients from queueing up their data. I am not committed to Celery if it isn't suitable. Is this the best solution, are there obvious problems with it, or are there better alternatives?
Django REST framework is a powerful and flexible toolkit for building Web APIs. Some reasons you might want to use REST framework: The Web browsable API is a huge usability win for your developers. Authentication policies including packages for OAuth1a and OAuth2.
Django REST framework is an open source, flexible and fully-featured library with modular and customizable architecture that aims at building sophisticated web APIs and uses Python and Django.
Who uses Django REST framework? 338 companies reportedly use Django REST framework in their tech stacks, including Robinhood, UpstageAI, and BirdView.
I find your approach is sound, Celery is great except for some border cases that can get a little nasty in my experience (but I wouldn't expect to run into that in the use case you outline in the question).
However, consider a simplified approach as follows using Redis. It has some pros and cons.
In BookViewSet:
from redis import StrictRedis
from rest_framework import viewsets, renderers
redis_client = StrictRedis()
class BookViewSet(viewsets.ModelViewSet):
queryset = Book.objects.all()
serializer_class = Book
def perform_create(self, serializer):
json = renderers.JSONRenderer().render(serializer.data)
redis_client.lpush('create_book_task', json)
In a separate worker script:
from django.utils.six import BytesIO
from redis import StrictRedis
from rest_framework.parsers import JSONParser
from myproject import BookSerializer, Book
MAX_BATCH_SIZE = 1000
def create_book_task():
bookset = []
for json in redis_client.brpop(('create_book_task',)):
stream = BytesIO(json)
data = JSONParser().parse(stream)
serializer = BookSerializer(data=data)
assert serializer.is_valid()
bookset.append(serializer.instance)
if len(bookset) >= MAX_BATCH_SIZE:
break
if len(bookset) > 0:
Book.objects.bulk_create(bookset)
while True:
create_book_task()
Pros
Cons
Of course the above is a first approach, you might want to make it more generic to reuse for additional models, move the MAX_BATCH_SIZE TO your settings, use pickling instead of JSON or a variety of other adjustments, improvements or design decisions according to your specific needs.
In the end, I would probably go along with the approach outlined in my answer, unless there are several other tasks you anticipate will be offloaded to asynchronous processing where the case for using Celery would become much stronger.
PS: Since the actual insertion will be done asynchronously consider responding with a 202 Accepted
response code instead of 201 Created
(unless this screws up your clients).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With