Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add Yarn cluster configuration to Spark application

I'm trying to use spark on yarn in a scala sbt application instead of using spark-submit directly.

I already have a remote yarn cluster running and I can connect to the yarn cluster run spark jobs in SparkR. But when I try to do similar thing in a scala application it couldn't load my environment variables to yarn configurations and instead use default yarn address and port.

The sbt application is just a simple object:

object simpleSparkApp {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("simpleSparkApp")
      .setMaster("yarn-client")
      .set("SPARK_HOME", "/opt/spark-1.5.1-bin-hadoop2.6")
      .set("HADOOP_HOME", "/opt/hadoop-2.6.0")
      .set("HADOOP_CONF_DIR", "/opt/hadoop-2.6.0/etc/hadoop")
    val sc = new SparkContext(conf)
  }
}

When I run this application in Intellij IDEA the log says:

15/11/15 18:46:05 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/11/15 18:46:06 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
15/11/15 18:46:07 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...

It seems the environment is not added correctly because 0.0.0.0 is not the ip of remote yarn resource manager node and my spark-env.sh has:

export JAVA_HOME="/usr/lib/jvm/ibm-java-x86_64-80"
export HADOOP_HOME="/opt/hadoop-2.6.0"
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
export SPARK_MASTER_IP="master"

and my yarn-site.xml has:

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
</property>

How can I correctly add environment variables of Yarn cluster configuration to this sbt Spark application?

Extra information:

My system is Ubuntu14.04 and the SparkR code that can connect to the yarn cluster looks like this:

Sys.setenv(HADOOP_HOME = "/opt/hadoop-2.6.0")
Sys.setenv(SPARK_HOME = "/opt/spark-1.4.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master = "yarn-client")
like image 545
Bamqf Avatar asked Nov 16 '15 04:11

Bamqf


1 Answers

These days, there's no out of the box solution to avoid the spark-submit usage for Yarn mode.

Spark-submit : to run the job, spark-submit run the org.apache.spark.deploy.yarn.Client code on the configured environment (or not configured as in your case). Here's the Client which does task submission: https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

What the solution though?

  1. There was an option to override the Client behavior as could be found here http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/ so that you could add an extra env variables, etc. Later Spark made the Yarn client private to the spark package (~ the end of 2014). So if to name your package org.apache.spark - is possibly an option..

  2. The built on top of spark-submit solution (with its advantages and drawbacks) is described here: http://www.henningpetersen.com/post/22/running-apache-spark-jobs-from-applications

What about SparkR.R, it uses spark-sumbit internally : https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R when it call launchBackend() from https://github.com/apache/spark/blob/master/R/pkg/R/client.R and give there all environment set already + arguments

like image 128
Elena Viter Avatar answered Sep 18 '22 12:09

Elena Viter