Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.

  1. Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager.
  2. Core which runs Datanode and Tasktracker daemons.
  3. Task which only runs TaskTracker only.

My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?

like image 639
Taher Koitawala Avatar asked Jan 07 '17 08:01

Taher Koitawala


People also ask

How are EMR task nodes different from core nodes?

Every EMR cluster has only one master node which manages the cluster and acts as NameNode and JobTracker. Core node- All the MapReduce tasks performed by the core nodes which acts as data nodes and execute Hadoop jobs. Task nodes are part of the task instance group and are optional. They only run tasktrackers.

What is task node?

A task node indicates when a user has two choices, such as approving or to rejecting a record. You use task nodes when your business process requires you to evaluate the record. You also use task nodes when you want to create a task assignment that routes the record to one or more individuals.

Which instances does Amazon EMR use as the nodes of the cluster?

Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes.

How many nodes does an EMR cluster have?

With Amazon EMR 5.23. 0 and later, you can launch a cluster with three master nodes to support high availability of applications like YARN Resource Manager, HDFS NameNode, Spark, Hive, and Ganglia. The master node is no longer a potential single point of failure with this feature.


3 Answers

According to AWS documentation [1]

The node types in Amazon EMR are as follows: Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.

Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.

Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.

According to AWS documentation [2]

Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.

Task nodes don't run the Data Node daemon, nor do they store data in HDFS.

Some Use cases are:

  • You can use Task nodes for processing streams from S3. In this case Network IO won't increase as the used data isn't on HDFS.
  • Task nodes can be added or removed as no HDFS daemons are running. Hence, no data on task nodes. Core nodes have HDFS daemons running and keep adding and removing new nodes isn't a good practice.

Resources:

[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters

[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task

like image 92
Abdelrahman Maharek Avatar answered Sep 28 '22 08:09

Abdelrahman Maharek


The reason why Hadoop suggest that we should have DataNode and Tasktracker Daemons on the same nodes is because it wants our processing power as close to data as possible.

But there also comes Rack level optimization when you have to deal with multi-nodes cluster. In my point of view AWS reducing I/O overhead by providing task nodes in the same rack in which Datanodes exists.

And the reason to provide Task nodes are that we need more processing over our data than to just focusing on storing them on HDFS. We would always want more TaskTracker than the Daemon nodes. So AWS has provided you the opportunity to increase it using a complete node benefiting RackLevel optimization.

And the way you want to get data into your cluster(using S3 and only core nodes) is a good option if you want good performance but using only a transient cluster.

like image 44
Hafiz Hashim Avatar answered Sep 28 '22 08:09

Hafiz Hashim


One use case is if you use spot instances as task nodes. If its cheap enough, it may be worth while to add some compute power to your EMR cluster. This would be mostly for non-sensitive tasks.

like image 33
Carlos Bribiescas Avatar answered Sep 28 '22 09:09

Carlos Bribiescas