Is it possible to run spark yarn cluster from the code?

Question

I have a MapReduce task which I want to run on Spark YARN cluster from my java code. Also I want to retrieve reduce result (string and number pair, tuple) in my java code. Something like:

// I know, it's wrong setMaster("YARN"), but just to describe what I want.
// I want to execute job ob the cluster.
SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("YARN");
JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaRDD<Integer> input = sc.parallelize(list);

// map
JavaPairRDD<String, Integer> results = input.mapToPair(new MapToPairExample());

// reduce
String max = results.max(new MyResultsComparator())._1();

It works if I set master to local, local[] or spark://master:7707.

So the question is: can I do the same with yarn cluster somehow?

Sheel Pancholi · Accepted Answer

Typically, a spark-submit command works the following way when passing the master as yarn and deploy mode as cluster (source: Github code base for spark):

spark-submit script calls Main.java
Main.java calls SparkSubmit.java
SparkSubmit.java calls YarnClusterApplication by figuring out the master and deploy parameters
YarnClusterApplication calls Client.java
Client.java talks to Resource Manager and hands over the ApplicationMaster.
The Resource Manager instantiates ApplicationMaster.java in a container on a Node Manager.
ApplicationMaster.java:
1. allocates containers for executors using ExecutorRunnables
2. uses reflection API to figure out the main method in the user supplied jar
3. spawns a thread that executes the user application by invoking that main method from Step 6.2. This is where your code executes

In this flow, Steps 1-5 happen on the client/gateway machine. Starting from Step 6, everything executes on the Yarn cluster.

Now, to answer your question, I haven't ever tried executing spark in yarn-cluster mode from the code, but based on the above flow, your piece of code can only run within an application master container in a Node Manager machine of the Yarn cluster if you wish it to run in yarn-cluster mode. And, your code can reach there only if you specify spark-submit --master yarn --deploy-mode cluster from the command line. So specifying it in the code and:

running the job e.g. from IDE will fail.
running the job using spark-submit --master yarn --deploy-mode cluster will mean executing your code in a thread in the ApplicationMaster which runs on a Node Manager machine in the Yarn cluster and which will ultimately re-execute your setMaster("yarn-cluster") line of code which is now redundant but the rest of your code will run successfully.

Any corrections to this are welcome!

Pankaj Arora · Answer

You need to do it using spark-submit. Spark submit handles many things for you from shipping dependencies to cluster and setting correct classpaths etc. When you are running it as main java program in local mode your IDE is taking care of the classpath(since driver/executors are running in same jvm).

You can also use "yarn-client" mode if you want your driver program to run on your machine.

For yarn-cluster mode use .setMaster("yarn-cluster")

You can also use "yarn-client" mode if you want your driver program to run on your machine.

For yarn-cluster mode use .setMaster("yarn-cluster")

Is it possible to run spark yarn cluster from the code?

Tags:

java

apache-spark

hadoop-yarn

pikkvile

2 Answers

Sheel Pancholi

Pankaj Arora

Recent Activity

Donate For Us

Is it possible to run spark yarn cluster from the code?

Tags:

java

apache-spark

hadoop-yarn

pikkvile

2 Answers

Sheel Pancholi

Pankaj Arora

Related questions

Recent Activity

Donate For Us