Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

YARN applications cannot start when specifying YARN node labels

I'm trying to use YARN node labels to tag worker nodes, but when I run applications on YARN (Spark or simple YARN app), those applications cannot start.

  • with Spark, when specifying --conf spark.yarn.am.nodeLabelExpression="my-label", the job cannot start (blocked on Submitted application [...], see details below).

  • with a YARN application (like distributedshell), when specifying -node_label_expression my-label, the application cannot start neither

Here are the tests I have made so far.

YARN node labels setup

I'm using Google Dataproc to run my cluster (example : 4 workers, 2 on preemptible nodes). My goal is to force any YARN application master to run on a non-preemptible node, otherwise the node can be shutdown at any time, thus making the application fail hard.

I'm creating the cluster using YARN properties (--properties) to enable node labels :

gcloud dataproc clusters create \
    my-dataproc-cluster \
    --project [PROJECT_ID] \
    --zone [ZONE] \
    --master-machine-type n1-standard-1 \
    --master-boot-disk-size 10 \
    --num-workers 2 \
    --worker-machine-type n1-standard-1 \
    --worker-boot-disk-size 10 \
    --num-preemptible-workers 2 \
    --properties 'yarn:yarn.node-labels.enabled=true,yarn:yarn.node-labels.fs-store.root-dir=/system/yarn/node-labels'

Versions of packaged Hadoop and Spark :

  • Hadoop version : 2.8.2
  • Spark version : 2.2.0

After that, I create a label (my-label), and update the two non-preemptible workers with this label :

yarn rmadmin -addToClusterNodeLabels "my-label(exclusive=false)"
yarn rmadmin -replaceLabelsOnNode "\
    [WORKER_0_NAME].c.[PROJECT_ID].internal=my-label \
    [WORKER_1_NAME].c.[PROJECT_ID].internal=my-label"

I can see the created label in YARN Web UI :

Label created on YARN

Spark

When I run a simple example (SparkPi) without specifying info about node labels :

spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode client \
  /usr/lib/spark/examples/jars/spark-examples.jar \
  10

In the Scheduler tab on YARN Web UI, I see the application launched on <DEFAULT_PARTITION>.root.default.

But when I run the job specifying spark.yarn.am.nodeLabelExpression to set the location of the Spark application master :

spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode client \
    --conf spark.yarn.am.nodeLabelExpression="my-label" \
    /usr/lib/spark/examples/jars/spark-examples.jar \
    10

The job is not launched. From YARN Web UI, I see :

  • YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
  • Diagnostics: Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = my-label ; Partition Resource = <memory:6144, vCores:2> ; Queue's Absolute capacity = 0.0 % ; Queue's Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 0.0 % ;

I suspect that the queue related to the label partition (not <DEFAULT_PARTITION, the other one) does not have sufficient resources to run the job :

Spark job accepted

Here, Used Application Master Resources is <memory:1024, vCores:1>, but the Max Application Master Resources is <memory:0, vCores:0>. That explains why the application cannot start, but I can't figure out how to change this.

I tried to update different parameters, but without success :

yarn.scheduler.capacity.root.default.accessible-node-labels=my-label

Or increasing those properties :

yarn.scheduler.capacity.root.default.accessible-node-labels.my-label.capacity
yarn.scheduler.capacity.root.default.accessible-node-labels.my-label.maximum-capacity
yarn.scheduler.capacity.root.default.accessible-node-labels.my-label.maximum-am-resource-percent
yarn.scheduler.capacity.root.default.accessible-node-labels.my-label.user-limit-factor
yarn.scheduler.capacity.root.default.accessible-node-labels.my-label.minimum-user-limit-percent

without success neither.

YARN Application

The issue is the same when running a YARN application :

hadoop jar \
    /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar \
    -shell_command "echo ok" \
    -jar /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar \
    -queue default \
    -node_label_expression my-label

The application cannot start, and the logs keeps repeating :

INFO distributedshell.Client: Got application report from ASM for, appId=6, clientToAMToken=null, appDiagnostics= Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = my-label ; Partition Resource = <memory:6144, vCores:2> ; Queue's Absolute capacity = 0.0 % ; Queue's Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 0.0 % ; , appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1520354045946, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, [...]

If I don't specify -node_label_expression my-label, the application start on <DEFAULT_PARTITION>.root.default and succeed.

Questions

  • Am I doing something wrong with the labels? However, I followed the official documentation and this guide
  • Is this a specific problem related to Dataproc? Because the previous guides seems to work on other environments
  • Maybe I need to create a specific queue and associate it with my label? But since I'm running a "one-shot" cluster to run a single Spark job I don't need to have specific queues, running jobs on the default root one is not a problem for my use-case

Thanks for helping

like image 612
norbjd Avatar asked Mar 07 '18 09:03

norbjd


People also ask

What are yarn node labels?

Node label is a way to group nodes with similar characteristics and spark jobs can be specified where to run. With node labelling, we can achieve partition on the cluster, and by default, nodes belong to the DEFAULT partition.

Does yarn run on master node?

The master node manages the cluster and typically runs master components of distributed applications. For example, the master node runs the YARN ResourceManager service to manage resources for applications.

What is node Labelling?

Node label is a way to group nodes with similar characteristics and applications can specify where to run. Now we only support node partition, which is: One node can have only one node partition, so a cluster is partitioned to several disjoint sub-clusters by node partitions.

What is yarn Apache?

One of Apache Hadoop's core components, YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.


1 Answers

A Google engineer answered us (on a private issue we raised, not in the PIT), and gave us a solution by specifying an initialization script to Dataproc cluster creation. I don't think the issue comes from Dataproc, this is basically just YARN configuration. The script sets the following properties in capacity-scheduler.xml, just after creating the node label (my-label) :

<property>
  <name>yarn.scheduler.capacity.root.accessible-node-labels</name>
  <value>my-label</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.accessible-node-labels.my-label.capacity</name>
  <value>100</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.default.accessible-node-labels</name>
  <value>my-label</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.default.accessible-node-labels.my-label.capacity</name>
  <value>100</value>
</property>

From the comment going along with the script, this "set accessible-node-labels on both root (the root queue) and root.default (the default queue applications actually get run on)". The root.default part is what was missing in my tests. Capacity for both is set to 100.

Then, restarting YARN (systemctl restart hadoop-yarn-resourcemanager.service) is needed to validate the modifications.

After that, I was able to start jobs that failed to complete in my question.

Hope that will help people having the same issues or similar.

like image 66
norbjd Avatar answered Oct 24 '22 19:10

norbjd