Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use the programmatic spark submit capability

There is a somewhat recent (Spring 2015) feature apparently intended to allow submitting a spark job programmatically.

Here is the JIRA https://issues.apache.org/jira/browse/SPARK-4924

However there is uncertainty (and count me as well) about how to actually use these features. Here are the last comments in the jira:

enter image description here

When asking the actual author of this work to further explain it is "look in the API docs".

The "user document" is the Spark API documentation.

The author did not provide further details and apparently feels the whole issue were self explanatory. If anyone can connect the dots here: specifically - where in the API docs is this newer Spark Submit capability described - it would be appreciated.

Here is some of the info I am looking for -Pointers to the following:

  • What capabilities have been added to the spark api
  • How do we use them
  • Any examples / other relevant documentation and/or code

Update The SparkLauncher referred to in the accepted answer does launch a simple app under trivial ( master=local[*]) conditions. It remains to be seen how usable it will be on an actual cluster. After adding a print statement to the linked code:

println("launched.. and waiting..") spark.waitFor()

We do see:

launched.. and waiting..

Well this is probably a small step forward. Will update this question as I move towards a real clustered environment.

like image 252
WestCoastProjects Avatar asked May 15 '16 20:05

WestCoastProjects


People also ask

What happens when we use spark submit?

Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.

How do I submit a python code in spark submit?

Spark Submit Python File Apache Spark binary comes with spark-submit.sh script file for Linux, Mac, and spark-submit. cmd command file for windows, these scripts are available at $SPARK_HOME/bin directory which is used to submit the PySpark file with . py extension (Spark with python) to the cluster.

How do I run spark submit in client mode?

You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client: Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.


1 Answers

Looking at the details of the pull request, it seems that the functionality is provided by the SparkLauncher class, described in the API docs here.

public class SparkLauncher extends Object

Launcher for Spark applications.

Use this class to start Spark applications programmatically. The class uses a builder pattern to allow clients to configure the Spark application and launch it as a child process.

The API docs are rather minimal, but I found a blog post that gives a worked example (code also available in a GitHub repo). I have copied a simplified version of the example below (untested) in case the links go stale:

import org.apache.spark.launcher.SparkLauncher

object Launcher extends App {
  val spark = new SparkLauncher()
    .setSparkHome("/home/user/spark-1.4.0-bin-hadoop2.6")
    .setAppResource("/home/user/example-assembly-1.0.jar")
    .setMainClass("MySparkApp")
    .setMaster("local[*]")
    .launch();
  spark.waitFor();
}

See also:

  • Another tutorial blog post / review of the feature
  • A book chapter on the topic
like image 135
DNA Avatar answered Sep 17 '22 13:09

DNA