How to decrease heartbeat time of slave nodes in Hadoop

Tags:

I am working on AWS EMR.

I want to get the information of died task node as soon as possible. But as per default setting in hadoop, heartbeat is shared after every 10 minutes.

This is the default key-value pair in mapred-default - mapreduce.jobtracker.expire.trackers.interval : 600000ms

I tried to modify default value to 6000ms using - this link

After that, whenever I terminate any ec2 machine from EMR cluster, I am not able to see state change that fast.(in 6 seconds)

Resource manager REST API - http://MASTER_DNS_NAME:8088/ws/v1/cluster/nodes

Questions-

What is the command to check the mapreduce.jobtracker.expire.trackers.interval value in running EMR cluster(Hadoop cluster)?
Is this the right key I am using to get the state change ? If it is not, please suggest any other solution.
What is the difference between DECOMMISSIONING vs DECOMMISSIONED vs LOST state of nodes in Resource manager UI ?

Update

I tried numbers of times, but it is showing ambiguous behaviour. Sometimes, it moved to DECOMMISSIONING/DECOMMISIONED state, and sometime it directly move to LOST state after 10 minutes.

I need a quick state change, so that I can trigger some event.

Here is my sample code -

List<Configuration> configurations = new ArrayList<Configuration>();

        Configuration mapredSiteConfiguration = new Configuration();
        mapredSiteConfiguration.setClassification("mapred-site");
        Map<String, String> mapredSiteConfigurationMapper = new HashMap<String, String>();
        mapredSiteConfigurationMapper.put("mapreduce.jobtracker.expire.trackers.interval", "7000");
        mapredSiteConfiguration.setProperties(mapredSiteConfigurationMapper);

        Configuration hdfsSiteConfiguration = new Configuration();
        hdfsSiteConfiguration.setClassification("hdfs-site");
        Map<String, String> hdfsSiteConfigurationMapper = new HashMap<String, String>();
        hdfsSiteConfigurationMapper.put("dfs.namenode.decommission.interval", "10");
        hdfsSiteConfiguration.setProperties(hdfsSiteConfigurationMapper);

        Configuration yarnSiteConfiguration = new Configuration();
        yarnSiteConfiguration.setClassification("yarn-site");
        Map<String, String> yarnSiteConfigurationMapper = new HashMap<String, String>();
        yarnSiteConfigurationMapper.put("yarn.resourcemanager.nodemanagers.heartbeat-interval-ms", "5000");
        yarnSiteConfiguration.setProperties(yarnSiteConfigurationMapper);

        configurations.add(mapredSiteConfiguration);
        configurations.add(hdfsSiteConfiguration);
        configurations.add(yarnSiteConfiguration);

This is the settings that I changed into AWS EMR (internally Hadoop) to reduce the time between state change from RUNNING to other state(DECOMMISSIONING/DECOMMISIONED/LOST).

315

asked Aug 14 '16 20:08

devsda

1 Answers

You can use "hdfs getconf". Please refer to this post Get a yarn configuration from commandline
These links give info about node manager health-check and the properties you have to check:

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html

Refer "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" in the below link:

https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Your queries are answered in this link:

https://issues.apache.org/jira/browse/YARN-914

Refer the "attachments" and "sub-tasks" area. In simple terms, if the currently running application master and task containers gets shut-down properly (and/or re-initiated in different other nodes) then the node manager is said to be DECOMMISSIONED (gracefully), else it is LOST.

Update:

"dfs.namenode.decommission.interval" is for HDFS data node decommissioning, it does not matter if you are concerned only about node manager. In exceptional cases, data node need not be a compute node.

Try yarn.nm.liveness-monitor.expiry-interval-ms (default 600000 - that is why you reported that the state changed to LOST in 10 minutes, set it to a smaller value as you require) instead of mapreduce.jobtracker.expire.trackers.interval.

You have set "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" as 5000, which means, the heartbeat goes to resource manager once in 5 seconds, whereas the default is 1000. Set it to a smaller value as you require.

112

answered Oct 20 '22 04:10

Marco99

Related questions
                            
                                Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "Status"
                            
                                JPA foreign key without relationship
                            
                                What is the most memory efficient method of storing a large number of Strings in a map?
                            
                                Gson toJson(), weird behavior (produce empty json)
                            
                                How do I use a regex to find a consecutive repeat in a string (i.e. [12][12]) but only of length 2 or greater?
                            
                                How can a Elasticsearch client be notified of a new indexed document?
                            
                                Java .Class file change string
                            
                                Scaling an Image and positioning it at 0,0 in WPF
                            
                                How to use auto-value with firebase 9.2 in Android
                            
                                Pagination for update. Is it possible?
                            
                                Camel multicast - transaction boundary
                            
                                How to make java.io.BufferedOutputStream secured for memory scraper for sensitive card data?
                            
                                After Selecting date the date picker should close in android without clicking okay button
                            
                                android java regex match all but one character
                            
                                Checkstyle Java generics: '?' is not preceded with whitespace
                            
                                camel thread pooling query
                            
                                How do I simulate a client aborting request?
                            
                                Apache Camel. Throttle Part of the Route
                            
                                Programmatically add node in AEM?
                            
                                Java API to HBase exception:cannot get location

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to decrease heartbeat time of slave nodes in Hadoop

Tags:

java

amazon-web-services

hadoop

mapreduce

amazon-emr

devsda

People also ask

1 Answers

Marco99

Recent Activity

Donate For Us