Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clarification of use-cases for Hadoop versus RabbitMQ+Celery

Tags:

I know that there are similar questions to this, such as:

  • https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
  • Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ

but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.

So, I'm a python user who wants to make programs that either/both:

  1. Are too large to
  2. Take too long to

do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.

In asking my usual smart CS advice-givers, I have phrased my question as:

"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."

According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:

  1. Why are they considered so separate, so different?
  2. Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
  3. What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
  4. Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)

I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.

like image 472
Mittenchops Avatar asked Aug 29 '13 21:08

Mittenchops


People also ask

Why you should use Celery with RabbitMQ?

It's incredibly lightweight, supports multiple brokers (RabbitMQ, Redis, and Amazon SQS), and also integrates with many web frameworks, e.g. Django, etc. Celery's asynchronous task queue allows the execution of tasks and its concurrency makes it useful in several production systems.

What is the difference between Celery and RabbitMQ?

From my understanding, Celery is a distributed task queue, which means the only thing that it should do is dispatching tasks/jobs to others servers and get the result back. RabbitMQ is a message queue, and nothing more. However, a worker could just listen to the MQ and execute the task when a message is received.

Why does Celery need a message broker?

Message broker such as RabbitMQ provide communication between nodes. Running your Celery clients, workers, and related broker in the cloud gives your team the power to easily manage and scale backend processes, jobs, and basic administrative tasks.

What is Celery software used for?

Celery allows Python applications to quickly implement task queues for many workers. It takes care of the hard part of receiving tasks and assigning them appropriately to workers. You use Celery to accomplish a few main goals: Define independent tasks that your workers can do as a Python function.


1 Answers

  1. They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.

  2. Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases. Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.

  3. So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....

  4. Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.

Resume:

  • Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
  • Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
  • Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
  • Have pain with no setup/config time: Sockets
like image 75
RickyA Avatar answered Oct 11 '22 08:10

RickyA