Why does YARN job not transition to RUNNING state?

Tags:

I've got a number of Samza jobs that I want to run. I can get the first to run ok. However, the second job seems to sit at the ACCEPTED state and never transitions into the RUNNING state until I kill the first job.

Here is the view from the YARN UI:

YARN UI

Here are the details for the second job, where you can see no node has been allocated: enter image description here

I have 2 datanodes, so I should be able to run multiple jobs. Here is the relevant section of my yarn-site.xml (the only other config I have in the file is to do with the HA config, Zookeeper etc):

<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
    <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2048</value>
    <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
    <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>2</value>
    <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
    <description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>4</value>
    <description>Number of CPU cores that can be allocated for containers.</description>
</property>

EDIT:

I can see in the resource manager logs:

2015-11-01 17:47:37,151 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1446300861747_0018_000001 container=Container: [ContainerId: container_1446300861747_0018_01_000002, NodeId: yarndata-01:41274, NodeHttpAddress: yarndata-01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>, usedCapacity=0.125, absoluteUsedCapacity=0.125, numApps=1, numContainers=1 clusterResource=<memory:8192, vCores:8>
2015-11-01 17:47:37,151 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2048, vCores:2>, usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=2
2015-11-01 17:47:37,151 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used=<memory:2048, vCores:2> cluster=<memory:8192, vCores:8>
2015-11-01 17:47:37,658 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : yarndata-01:41274 for container : container_1446300861747_0018_01_000002
2015-11-01 17:47:37,659 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1446300861747_0018_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2015-11-01 17:47:39,154 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1446300861747_0018_01_000002 Container Transitioned from ACQUIRED to RUNNING
2015-11-01 17:48:03,821 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 19
2015-11-01 17:48:04,339 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 19 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.
2015-11-01 17:48:04,339 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 19 submitted by user www-data
2015-11-01 17:48:04,339 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=www-data IP=192.168.2.81 OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1446300861747_0019
2015-11-01 17:48:04,340 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1446300861747_0019
2015-11-01 17:48:04,340 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1446300861747_0019 State change from NEW to NEW_SAVING
2015-11-01 17:48:04,340 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1446300861747_0019
2015-11-01 17:48:04,342 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1446300861747_0019 State change from NEW_SAVING to SUBMITTED
2015-11-01 17:48:04,342 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application added - appId: application_1446300861747_0019 user: www-data leaf-queue of parent: root #applications: 2
2015-11-01 17:48:04,342 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Accepted application application_1446300861747_0019 from user: www-data, in queue: default
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1446300861747_0019 State change from SUBMITTED to ACCEPTED
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1446300861747_0019_000001
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1446300861747_0019_000001 State change from NEW to SUBMITTED
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit
2015-11-01 17:48:04,343 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application added - appId: application_1446300861747_0019 user: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@202c5cd5, leaf-queue: default #user-pending-applications: 1 #user-active-applications: 1 #queue-pending-applications: 1 #queue-active-applications: 1

What am I not doing correctly please?

387

asked Nov 01 '15 17:11

John

1 Answers

The answer lay in the fact that the resource manager was saying there was not enough resource to create a new samza container plus the application master.

I changed the value of yarn.scheduler.capacity.maximum-am-resource-percent within capacity-scheduler.xml to be more than the default of 0.1.

The documentation for this parameter states:

Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running applications.

173

answered Nov 10 '22 11:11

John

Related questions
                            
                                Shuffle and sort for mapreduce
                            
                                Difference between mapreduce split and spark paritition
                            
                                How to copy files from HDFS to S3 effectively programatically
                            
                                Using mahout and hadoop
                            
                                Full utilization of all cores in Hadoop pseudo-distributed mode
                            
                                how to prevent hadoop job to fail on corrupted input file
                            
                                Oozie SSH Action
                            
                                Hive is throwing permission error while creating table/database
                            
                                what is the meaning of namespace and metadata which were used in hdfs(namenode)
                            
                                Maven dependencies for Hadoop: MiniDFSCluster & MiniMRCluster
                            
                                hadoop command and SLF4J error message cdh in ubuntu
                            
                                Pig - ERROR 1045: AVG as multiple or none of them fit. Please use an explicit cast
                            
                                Pig keeps trying to connect to job history server (and fails)
                            
                                Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
                            
                                Elastic Map Reduce External Jars
                            
                                Pydoop on Amazon EMR
                            
                                how can i work with large number of small files in hadoop?
                            
                                Change Block size of existing files in Hadoop
                            
                                Change Hive Database location
                            
                                How to use the ResourceManager web interface as an user

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does YARN job not transition to RUNNING state?

Tags:

hadoop

hadoop-yarn

apache-samza

John

People also ask

1 Answers

John

Recent Activity

Donate For Us