Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Architecture for distributed workers

We are creating a website able to distribute tasks across multiple geographical sites. The website should be able to:

  • create a task,
  • put it in a queue,
  • assign it to a worker depending on a geographical criteria,
  • update the web interface according to the working status (step 1, 2 3 etc.),
  • save the final result in mongodb and notice the web interface.

We can have parallel jobs working as long as they are not in the same geographical criteria.

We can delete a job as long as it is not in processing state.

Our current stack is: Angulajs - nodejs - mongodb.

Our first idea was to make an HTTP pooling from the distant workers to the mongodb task. The point is that we will have more than 20 distant workers and we would like a high frequency refresh (< 1s). We think that this solution is easy to implement but will be difficult to maintain and make overload of the DB. This solution is highly dependent to the network ping.

After some researchs on the web, we found documentation on rabbitMQ and message system. This seems to fit most of our requirements but I don’t see how we can delete a specific job in a queue in pending state and how we can easily handle the update of the task status.

We found also documentation about redis, a KV system in RAM. This solves the issue to be able to delete a specific task in a queue and reduce mongodb load but we don’t see how we will be able to notice distant worker on the job to do. If it is HTTP pooling, we lost all the benefits.

Our situation seems to be a usual problem I and would like to know what the best solution is?

like image 497
Julio Avatar asked Mar 25 '14 13:03

Julio


2 Answers

Redis

Redis is great because you can use it for other features besides Job Queuing, like caching. I personally use Kue. Kueing jobs across datacenters might not be the best decision. Though I don't understand your circumstance, it's generally accepted that your data model be centralized where as your content be distributed. I run a service that hosts an API in San Fransisco, and has CDN nodes in San Fran and NYC. My content is server side templates, images, scripts, css, etc. Which can be completely populated by my API.

Outsource

If you absolutely need this functionality I would personally recommend iron.io. They offer 2 services that may be able to solve your problem. Firstly they offer an MQ system through a RESTful API, which is very easy to use and works perfectly with node. The also offer a Worker service, which allows you to queue, schedule, and run tasks on their stack. This would be limiting if you needed to access resources from your own cloud, in which case I would recommend ironMQ.

Insource

If you don't want to outsource your service, and you want to host an MQ I would not recommend rabbitMQ for job queuing. I'd recommend something like beanstalkd which is more geared towards job queuing, where as RabbitMQ is more geared towards message queuing(who'd thunk?).

Additionally:

Having read some of the comments to some of the other answers it seems to me that beanstalkd might be your best approach. It's more specific to job queuing, whereas many other MQ systems are to message about updates and push new data across your cloud in realtime and you'll have to implement your own Job Queuing system on top of that.

like image 193
tsturzl Avatar answered Oct 22 '22 09:10

tsturzl


Rabbit MQ, Redis and ZeroMQ are awesome but you can do it without leave mongoDB. There are special collections named capped collections that allow streaming and they are both extremely fast and cheap to work it for your database. You can have your workers (or another process) listening the queue and then doing the tasks.

For example, imagine that you have a worker for every region and said regions are tagged with strings. Then we just need to create a internal queue to handle the updates in your main logic. We will use mongoose and async to show it:

var internalQueue = async.queue(function (doc, callback) {
    doc.status = 2; 
    doc.save(function(e){ // We update the status of the task
        // And we follow from here, doing whatever we want to do
    })
}, 1);



mongoose
.TaskModel
.find({
    status: 1,
    region: "KH" // Unstarted stuff from Camboya
})
.stream()
.on('data', function (doc){
    internalQueue.push(doc, function(e){
        console.log('We have finished our task, alert the web interface or save me or something');
    });
});

Maybe you don't want to use mongoose, or async, or want to use geoqueries or more than one worker per region but you can do it with the tools you already have: mongoDB and Node.js

To start working around with capped collections just use createCollection on mongoDB terminal:

db.createCollection('test', {capped: true, size: 100*1000, max: 100} )

Just remember two things:

  1. Data will expire based on insert order not time or last access to that document so do your collections big enough.
  2. You can't remove a document, but you can simply empty it
like image 27
durum Avatar answered Oct 22 '22 11:10

durum