Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to run spark yarn cluster from the code?

I have a MapReduce task which I want to run on Spark YARN cluster from my java code. Also I want to retrieve reduce result (string and number pair, tuple) in my java code. Something like:

// I know, it's wrong setMaster("YARN"), but just to describe what I want.
// I want to execute job ob the cluster.
SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("YARN");
JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaRDD<Integer> input = sc.parallelize(list);

// map
JavaPairRDD<String, Integer> results = input.mapToPair(new MapToPairExample());

// reduce
String max = results.max(new MyResultsComparator())._1();

It works if I set master to local, local[] or spark://master:7707.

So the question is: can I do the same with yarn cluster somehow?

like image 849
pikkvile Avatar asked Feb 20 '16 12:02

pikkvile


2 Answers

Typically, a spark-submit command works the following way when passing the master as yarn and deploy mode as cluster (source: Github code base for spark):

  1. spark-submit script calls Main.java
  2. Main.java calls SparkSubmit.java
  3. SparkSubmit.java calls YarnClusterApplication by figuring out the master and deploy parameters
  4. YarnClusterApplication calls Client.java
  5. Client.java talks to Resource Manager and hands over the ApplicationMaster.
  6. The Resource Manager instantiates ApplicationMaster.java in a container on a Node Manager.
  7. ApplicationMaster.java:
    1. allocates containers for executors using ExecutorRunnables
    2. uses reflection API to figure out the main method in the user supplied jar
    3. spawns a thread that executes the user application by invoking that main method from Step 6.2. This is where your code executes

In this flow, Steps 1-5 happen on the client/gateway machine. Starting from Step 6, everything executes on the Yarn cluster.

Now, to answer your question, I haven't ever tried executing spark in yarn-cluster mode from the code, but based on the above flow, your piece of code can only run within an application master container in a Node Manager machine of the Yarn cluster if you wish it to run in yarn-cluster mode. And, your code can reach there only if you specify spark-submit --master yarn --deploy-mode cluster from the command line. So specifying it in the code and:

  1. running the job e.g. from IDE will fail.
  2. running the job using spark-submit --master yarn --deploy-mode cluster will mean executing your code in a thread in the ApplicationMaster which runs on a Node Manager machine in the Yarn cluster and which will ultimately re-execute your setMaster("yarn-cluster") line of code which is now redundant but the rest of your code will run successfully.

Any corrections to this are welcome!

like image 192
Sheel Pancholi Avatar answered Oct 23 '22 23:10

Sheel Pancholi


You need to do it using spark-submit. Spark submit handles many things for you from shipping dependencies to cluster and setting correct classpaths etc. When you are running it as main java program in local mode your IDE is taking care of the classpath(since driver/executors are running in same jvm).

You can also use "yarn-client" mode if you want your driver program to run on your machine.

For yarn-cluster mode use .setMaster("yarn-cluster")

like image 1
Pankaj Arora Avatar answered Oct 23 '22 21:10

Pankaj Arora