Yarn differs in its infrastructure layer from the original map reduce architecture in the following way:
In YARN, the job tracker is split into two different daemons called Resource Manager
and Node Manager
(node specific). The resource manager only manages the allocation of resources to the different jobs apart from comprising a scheduler which just takes care of the scheduling jobs without worrying about any monitoring or status updates. Different resources such as memory, cpu time, network bandwidth etc. are put into one unit called the Resource Container
. There are different AppMasters
running on different nodes which talk to a number of these resource containers and accordingly update the Node Manager with the monitoring/status details.
I want to know that how does using this kind of an approach increase the performance from the map-reduce perspective? Also, if there is any definitive content on the motivation behind Yarn and its benefits over the existing implementation of Map-reduce, please point me to the same.
Here are some of the articles (1, 2, 3) about YARN. These talk about the benefits of using YARN.
YARN is more general than MR and it should be possible to run other computing models like BSP besides MR. Prior to YARN, it required a separate cluster for MR, BSP and others. Now they they can coexist in a single cluster, which leads to higher usage of the cluster. Here are some of the applications ported to YARN.
From a MapReduce perspective in legacy MR there are separate slots for Map and Reduce tasks, but in YARN their is no fixed purpose of a container. The same container can be used for a Map task, Reduce task, Hama BSP Task or something else. This leads to better utilization.
Also, it makes it possible to run different versions of Hadoop in the same cluster which is not possible with legacy MR, which makes is easy from a maintenance point.
Here are some of the additional links for YARN. Also, Hadoop: The Definitive Guide, 3rd Edition has an entire section dedicated to YARN.
FYI, it had been a bit controversial to develop YARN instead of using some of frameworks which had been doing something similar and had been running for ages successfully with bugs ironed out.
I do not think that Yarn will speedup the existing MR framework. Looking into architecture we can see that the system now is more modular - but modularity usually contradicts higher performance.
It can be claimed that YARN has nothing to do with MapReduce. MapReduce just became one of the YARN applications. You can see it as moving from some embedded program to embeded OS with program within it
At the same time Yarn opens the door for different MR implementations with different frameworks. For example , if we assume that our dataset is smaller then cluster memory we can get much better performance. I think http://www.spark-project.org/ is one such example
To summarize it: Yarn does not improve the existing MR, but will enable other MR implementations to be better in all aspects.
Let's see Hadoop 1.0 drawbacks, which have been addressed by Hadoop 2.0 with addition of Yarn.
Now single Job Tracker bottleneck has been removed with YARN architecture in Hadoop 2.x
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
Now advantages of YARN
All the above answers covered lot of information: I am simplifying all the information as follows:
MapReduce: YARN: 1. It is Platform plus Application It is a Platform in Hadoop 2.0 and in Hadoop 1. 0 and it is only of doesn't exist in Hadoop 1.0 the applications in Hadoop 2.0 2. It is single use system i.e., It is multi purpose system, We can run We can run MapReduce jobs only. MapReduce, Spark, Tez, Flink, BSP, MPP, MPI, Giraph etc... (General Purpose) 3. JobTracker scalability i.e., Both Resource Management and Both Resource Management and Application Management gets separated & Job Management managed by RM+NM, Paradigm specific AMs respectively. 4. Poor Resource Management Flexible Resource Management i.e., system i.e., slots (map/reduce) containers. 5. It is not highly available High availability and reliability. 6. Scaled out up to 5000 nodes Scaled out 10000 plus nodes. 7. Job->tasks Application -> DAG of Jobs -> tasks 8. Classical MapReduce = MapReduce Yarn MapReduce = MapReduce API + API + MapReduce FrameWork MapReduce FrameWork + YARN System + MapReduce System So MR programs which were written over Hadoop 1.0 run over Yarn also with out changing a single line of code i.e., backward compatibility.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With