Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Helix vs YARN

What is the difference between Apache Helix and Hadoop YARN (MRv2). Does anyone have experience with both technologies? Can someone explain me the advantages/disadvantages of Helix over YARN and why the LinkedIn guys developed their own cluster management instead of using YARN?

Thanks in advance Tobi

like image 959
Tobi Avatar asked May 06 '13 14:05

Tobi


1 Answers

While Helix and YARN both provide capabilities to manage distributed applications, there are important differences between the two.

YARN primarily provides resource management capabilities across a cluster of machines while requiring applications to write their custom logic to negotiate resources from the resource manager. On the other hand, Helix provides a way of declaratively managing the state of distributed applications, thus freeing the applications from having to do a custom implementation. At this time, Helix does not provide resource management capabilities in the same way as YARN. Thus the two systems are quite complementary.

As an illustration, assume you have a set of nodes and you want to start some containers on them.

  1. Allocate containers among nodes based on the resource utilization
  2. start containers,
  3. monitor container, if they die restart containers

YARN provides the framework/machinery to do the above. Once you have the containers, you have to implement the following features:

  1. Partitioning and Replication: You need to distribute tasks to containers, possibly allocate multiple tasks to each container. For redundancy you might chose to allocate a task to multiple containers.
  2. State management: manage the state of the task
  3. Fault Tolerance: When a container fails you might either chose to redistribute work among remaining containers or restart the container depending on SLA requirement.
  4. Cluster expansion: You might start new containers to handle the workload, then you want the task to be re-distributed.
  5. Throttling: During all these operations you might want to limit some operations like data movement

Helix makes it easy to achieve the above features. In YARN one needs to write the application master to achieve these (A example of such implementation is the Application master for hadoop map reduce jobs).

Helix was developed at LinkedIn to manage distributed data systems in the online/nearline space. In this space once a container is launched it runs for ever until it crashes. When a container fails, tasks might be redistributed among remaining containers.

YARN comes with resource scheduling algorithms that allows flexible and efficient utilization of the available hardware for short lived tasks like the map reduce jobs.

like image 52
Kishore G Avatar answered Nov 18 '22 00:11

Kishore G