What is the difference between Apache Helix and Hadoop YARN (MRv2). Does anyone have experience with both technologies? Can someone explain me the advantages/disadvantages of Helix over YARN and why the LinkedIn guys developed their own cluster management instead of using YARN?
Thanks in advance Tobi
While Helix and YARN both provide capabilities to manage distributed applications, there are important differences between the two.
YARN primarily provides resource management capabilities across a cluster of machines while requiring applications to write their custom logic to negotiate resources from the resource manager. On the other hand, Helix provides a way of declaratively managing the state of distributed applications, thus freeing the applications from having to do a custom implementation. At this time, Helix does not provide resource management capabilities in the same way as YARN. Thus the two systems are quite complementary.
As an illustration, assume you have a set of nodes and you want to start some containers on them.
YARN provides the framework/machinery to do the above. Once you have the containers, you have to implement the following features:
Helix makes it easy to achieve the above features. In YARN one needs to write the application master to achieve these (A example of such implementation is the Application master for hadoop map reduce jobs).
Helix was developed at LinkedIn to manage distributed data systems in the online/nearline space. In this space once a container is launched it runs for ever until it crashes. When a container fails, tasks might be redistributed among remaining containers.
YARN comes with resource scheduling algorithms that allows flexible and efficient utilization of the available hardware for short lived tasks like the map reduce jobs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With