Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I store state for a long-running process invoked from Django?

I am working on a Django application which allows a user to upload files. I need to perform some server-side processing on these files before sending them on to Amazon S3. After reading the responses to this question and this blog post I decided that the best manner in which to handle this is to have my view handler invoke a method on Pyro remote object to perform the processing asynchronously and then immediately return an Http 200 to the client. I have this prototyped and it seems to work well, however, I would also like to store the state of the processing so that the client can poll the application to see if the file has been processed and uploaded to S3.

I can handle the polling easily enough, but I am not sure where the appropriate location is to store the process state. It needs to be writable by the Pyro process and readable by my polling view.

  • I am hesitant to add columns to the database for data which should really only persist for 30 to 60 seconds.
  • I have considered using Django's low-level cache API and using a file id as the key, however, I don't believe this is really what the cache framework is designed for and I'm not sure what unforeseen problems there might be with going this route.
  • Lastly, I have considered storing state in the Pyro object doing the processing, but then it still seems like I would need to add a boolean "processing_complete" database column so that the view knows whether or not to query state from the Pyro object.

Of course, there are also some data integrity concerns with decoupling state from the database (what happens if the server goes down and all this data is in-memory?). I am to hear how more seasoned web application developers would handle this sort of stateful processing.

like image 213
bouvard Avatar asked May 12 '09 15:05

bouvard


2 Answers

We do this by having a "Request" table in the database.

When the upload arrives, we create the uploaded File object, and create a Request.

We start the background batch processor.

We return a 200 "we're working on it" page -- it shows the Requests and their status.

Our batch processor uses the Django ORM. When it finishes, it updates the Request object. We can (but don't) send an email notification. Mostly, we just update the status so that the user can log in again and see that processing has completed.


Batch Server Architecture notes.

It's a WSGI server that waits on a port for a batch processing request. The request is a REST POST with an ID number; the batch processor looks this up in the database and processes it.

The server is started automagically by our REST interface. If it isn't running, we spawn it. This makes a user transaction appear slow, but, oh well. It's not supposed to crash.

Also, we have a simple crontab to check that it's running. At most, it will be down for 30 minutes between "are you alive?" checks. We don't have a formal startup script (we run under Apache with mod_wsgi), but we may create a "restart" script that touches the WSGI file and then does a POST to a URL that does a health-check (and starts the batch processor).

When the batch server starts, there may be unprocessed requests for which it has never gotten a POST. So, the default startup is to pull ALL work out of the Request queue -- assuming it may have missed something.

like image 51
S.Lott Avatar answered Oct 13 '22 11:10

S.Lott


I know this is an old question but someone may find my answer useful even after all this time, so here goes.

You can of course use database as queue but there are solutions developed exactly for that purpose.

AMQP is made just for that. Together with Celery or Carrot and a broker server like RabbitMQ or ZeroMQ.

That's what we are using in our latest project and it is working great.

For your problem Celery and RabbitMQ seems like a best fit. RabbitMQ provides persistency of your messages, and Celery exposes easy views for polling to check the status of processes run in parallel.

You may also be interested in octopy.

like image 30
Bartosz Ptaszynski Avatar answered Oct 13 '22 10:10

Bartosz Ptaszynski