Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster?

The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.

like image 579
Marsellus Wallace Avatar asked Sep 03 '15 20:09

Marsellus Wallace


People also ask

What is EMR node?

EMR cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. The master node manages the cluster and coordinates the distribution of data and tasks among other nodes for processing.

Can you restart a terminated EMR cluster?

You can't restart a terminated cluster, but you can clone a terminated cluster to reuse its configuration for a new cluster. For more information, see Cloning a cluster using the console.

What are the limitations of EMR cluster with multiple master nodes *?

Limitations of an EMR cluster with multiple master nodes: If any two master nodes fail simultaneously, Amazon EMR cannot recover the cluster. Amazon EMR clusters with multiple master nodes are not tolerant to Availability Zone failures.

How do I check my EMR cluster status?

View cluster status using the AWS CLI You can use the describe-cluster command to view cluster-level details including status, hardware and software configuration, VPC settings, bootstrap actions, instance groups, and so on. For more information about cluster states, see Understanding the cluster lifecycle.


1 Answers

EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.

Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.

like image 79
annunarcist Avatar answered Oct 13 '22 02:10

annunarcist