Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differences between MapReduce and Yarn

I was searching about hadoop and mapreduce with respect to straggler problems and the papers in this problem
but yesterday I found that there is hadoop 2 with Yarn ,,
unfortunately no paper is talking about straggler problem in Yarn
So I want to know what is difference between MapReduce and Yarn in the part straggler? is Yarn suffer from straggler problem?
and when MRmaster asks resource manger for resources , resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities ?
thanks so much,,

like image 320
Flowra Avatar asked Nov 15 '14 08:11

Flowra


People also ask

What is the difference between MapReduce and YARN?

Where YARN is just a Resource manager so MapReduce is the process to distribute the data processing task and to manage the complete task. A set of resources is used in MapReduce for the complete task.

What is MapReduce and YARN in Hadoop?

MapReduce is the processing framework for processing vast data in the Hadoop cluster in a distributed manner. YARN is responsible for managing the resources amongst applications in the cluster. The HDFS daemon NameNode and YARN daemon ResourceManager run on the master node in the Hadoop cluster.

Does YARN replace MapReduce?

Is YARN a replacement of MapReduce in Hadoop? No, Yarn is the not the replacement of MR. In Hadoop v1 there were two components hdfs and MR. MR had two components for job completion cycle.


2 Answers

Here are the MapReduce 1.0 and MapReduce 2.0 (YARN)

MapReduce 1.0

In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.

Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.

MapReduce 2.0

MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.

In MapReduce 2.0, the JobTracker is divided into three services:

  1. ResourceManager, a persistent YARN service that receives and runs applications on the cluster. A MapReduce job is an application.
  2. JobHistoryServer, to provide information about completed jobs
  3. Application Master, to manage each MapReduce job and is terminated when the job completes.

TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.

This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and lets Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.

like image 98
Snigh Avatar answered Sep 22 '22 10:09

Snigh


You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.

"when MRmaster asks resource manger for resources?" when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.

"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities" I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time.

like image 22
Whilda Chaq Avatar answered Sep 23 '22 10:09

Whilda Chaq