Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

django/celery: Best practices to run tasks on 150k Django objects?

Tags:

orm

django

celery

I have to run tasks on approximately 150k Django objects. What is the best way to do this? I am using the Django ORM as the Broker. The database backend is MySQL and chokes and dies during the task.delay() of all the tasks. Related, I was also wanting to kick this off from the submission of a form, but the resulting request produced a very long response time that timed out.

like image 439
Brandon Lorenz Avatar asked Sep 21 '11 00:09

Brandon Lorenz


3 Answers

I would also consider using something other than using the database as the "broker". It really isn't suitable for this kind of work.

Though, you can move some of this overhead out of the request/response cycle by launching a task to create the other tasks:

from celery.task import TaskSet, task

from myapp.models import MyModel

@task
def process_object(pk):
    obj = MyModel.objects.get(pk)
    # do something with obj

@task
def process_lots_of_items(ids_to_process):
    return TaskSet(process_object.subtask((id, ))
                       for id in ids_to_process).apply_async()

Also, since you probably don't have 15000 processors to process all of these objects in parallel, you could split the objects in chunks of say 100's or 1000's:

from itertools import islice
from celery.task import TaskSet, task
from myapp.models import MyModel

def chunks(it, n):
    for first in it:
        yield [first] + list(islice(it, n - 1))

@task
def process_chunk(pks):
    objs = MyModel.objects.filter(pk__in=pks)
    for obj in objs:
        # do something with obj

@task
def process_lots_of_items(ids_to_process):
    return TaskSet(process_chunk.subtask((chunk, ))
                       for chunk in chunks(iter(ids_to_process),
                                           1000)).apply_async()
like image 141
asksol Avatar answered Oct 04 '22 19:10

asksol


Try using RabbitMQ instead.

RabbitMQ is used in a lot of bigger companies and people really rely on it, since it's such a great broker.

Here is a great tutorial on how to get you started with it.

like image 24
ApPeL Avatar answered Oct 04 '22 18:10

ApPeL


I use beanstalkd ( http://kr.github.com/beanstalkd/ ) as the engine. Adding a worker and a task is pretty straightforward for Django if you use django-beanstalkd : https://github.com/jonasvp/django-beanstalkd/

It’s very reliable for my usage.

Example of worker :

import os
import time

from django_beanstalkd import beanstalk_job


@beanstalk_job
def background_counting(arg):
    """
    Do some incredibly useful counting to the value of arg
    """
    value = int(arg)
    pid = os.getpid()
    print "[%s] Counting from 1 to %d." % (pid, value)
    for i in range(1, value+1):
        print '[%s] %d' % (pid, i)
        time.sleep(1)

To launch a job/worker/task :

from django_beanstalkd import BeanstalkClient
client = BeanstalkClient()

client.call('beanstalk_example.background_counting', '5')

(source extracted from example app of django-beanstalkd)

Enjoy !

like image 42
Olivier D. Avatar answered Oct 04 '22 18:10

Olivier D.