Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Async spawing of processes: design question - Celery or Twisted

All: I'm seeking input/guidance/and design ideas. My goal is to find a lean but reliable way to take XML payload from an HTTP POST (no problems with this part), parse it, and spawn a relatively long-lived process asynchronously.

The spawned process is CPU intensive and will last for roughly three minutes. I don't expect much load at first, but there's a definite possibility that I will need to scale this out horizontally across servers as traffic hopefully increases.

I really like the Celery/Django stack for this use: it's very intuitive and has all of the built-in framework to accomplish exactly what I need. I started down that path with zeal, but I soon found my little 512MB RAM cloud server had only 100MB of free memory and I started sensing that I was headed for trouble once I went live with all of my processes running full-tilt. Also, it's got several moving parts: RabbitMQ, MySQL, cerleryd, ligthttpd and the django container.

I can absolutely increase the size of my server, but I'm hoping to keep my costs down to a minimum at this early phase of this project.

As an alternative, I'm considering using twisted for the process management, as well as perspective broker for the remote systems, should they be needed. But for me at least, while twisted is brilliant, I feel like I'm signing up for a lot going down that path: writing protocols, callback management, keeping track of job states, etc. The benefits here are pretty obvious - excellent performance, far fewer moving parts, and a smaller memory footprint (note: I need to verify the memory part). I'm heavily skewed toward Python for this - it's much more enjoyable for me than the alternatives :)

I'd greatly appreciate any perspective on this. I'm concerned about starting things off on the wrong track, and redoing this later with production traffic will be painful.

-Matt

like image 568
mcauth Avatar asked Jan 08 '11 17:01

mcauth


1 Answers

On my system, RabbitMQ running with pretty reasonable defaults is using about 2MB of RAM. Celeryd uses a bit more, but not an excessive amount.

In my opinion, the overhead of RabbitMQ and celery are pretty much negligible compared to the rest of the stack. If you're processing jobs that are going to take several minutes to complete, those jobs are what will overwhelm your 512MB server as soon as your traffic increases, not RabbitMQ. Starting off with RabbitMQ and Celery will at least set you up nicely to scale those jobs out horizontally though, so you're definitely on the right track there.

Sure, you could write your own job control in Twisted, but I don't see it gaining you much. Twisted has pretty good performance, but I wouldn't expect it to outperform RabbitMQ by enough to justify the time and potential for introducing bugs and architectural limitations. Mostly, it just seems like the wrong spot to worry about optimizing. Take the time that you would've spent re-writing RabbitMQ and work on reducing those three minute jobs by 20% or something. Or just spend an extra $20/month and double your capacity.

like image 54
thraxil Avatar answered Oct 28 '22 02:10

thraxil