Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should django model object instances be passed to celery?

Tags:

# models.py from django.db import models  class Person(models.Model):     first_name = models.CharField(max_length=30)     last_name = models.CharField(max_length=30)     text_blob = models.CharField(max_length=50000)  # tasks.py import celery @celery.task def my_task(person):     # example operation: does something to person      # needs only a few of the attributes of person     # and not the entire bulky record     person.first_name = person.first_name.title()     person.last_name = person.last_name.title()     person.save() 

In my application somewhere I have something like:

from models import Person from tasks import my_task import celery g = celery.group([my_task.s(p) for p in Person.objects.all()]) g.apply_async() 
  • Celery pickles p to send it to the worker right?
  • If the workers are running on multiple machines, would the entire person object (along with the bulky text_blob which is primarily not required) be transmitted over the network? Is there a way to avoid it?
  • How can I efficiently and evenly distribute the Person records to workers running on multiple machines?

  • Could this be a better idea? Wouldn't it overwhelm the db if Person has a few million records?

    # tasks.py  import celery from models import Person @celery.task def my_task(person_pk):     # example operation that does not need text_blob     person = Person.objects.get(pk=person_pk)     person.first_name = person.first_name.title()     person.last_name = person.last_name.title()     person.save()   #In my application somewhere from models import Person from tasks import my_task import celery g = celery.group([my_task.s(p.pk) for p in Person.objects.all()]) g.apply_async() 
like image 948
Anuvrat Parashar Avatar asked Feb 26 '13 00:02

Anuvrat Parashar


People also ask

Should I use Celery with Django?

Celery makes it easier to implement the task queues for many workers in a Django application.

How do I start Celery beat Django?

This command has used for start the celery beat. Firstly add the django_celery_beat module in installed apps in settings file. And then apply the django migrate command, this will create the tables in admin pannel. After completing all the process like in celery file and create task in tasks.py .


2 Answers

I believe it is better and safer to pass PK rather than the whole model object. Since PK is just a number, serialization is also much simpler. Most importantly, you can use a safer sarializer (json/yaml instead of pickle) and have a peace of mind that you won't have any problems with serializing your model.

As this article says:

Since Celery is a distributed system, you can't know in which process, or even on what machine the task will run. So you shouldn't pass Django model objects as arguments to tasks, its almost always better to re-fetch the object from the database instead, as there are possible race conditions involved.

like image 182
knaperek Avatar answered Oct 15 '22 01:10

knaperek


Yes. If there are millions of records in the database then this probably isn't the best approach, but since you have to go through all many millions of the records, then pretty much no matter what you do, your DB is going to get hit pretty hard.

Here are some alternatives, none of which I'd call "better", just different.

  1. Implement a pre_save signal handler for your Person class that does the .title() stuff. That way your first_name/last_names will always get stored correctly in the db and you'll not have to do this again.
  2. Use a management command that takes some kind of paging parameter...perhaps use the first letter of the last name to segment the Persons. So running ./manage.py my_task a would update all the records where the last name starts with "a". Obviously you'd have to run this several times to get through the whole database
  3. Maybe you can do it with some creative sql. I'm not even going to attempt here, but it might be worth investigating.

Keep in mind that the .save() is going to be the harder "hit" to the database then actually selecting the millions of records.

like image 44
Al W Avatar answered Oct 15 '22 01:10

Al W