There is a somewhat recent (Spring 2015) feature apparently intended to allow submitting a spark job programmatically.
Here is the JIRA https://issues.apache.org/jira/browse/SPARK-4924
However there is uncertainty (and count me as well) about how to actually use these features. Here are the last comments in the jira:
When asking the actual author of this work to further explain it is "look in the API docs".
The "user document" is the Spark API documentation.
The author did not provide further details and apparently feels the whole issue were self explanatory. If anyone can connect the dots here: specifically - where in the API docs is this newer Spark Submit capability described - it would be appreciated.
Here is some of the info I am looking for -Pointers to the following:
Update The SparkLauncher
referred to in the accepted answer does launch a simple app under trivial ( master=local[*]) conditions. It remains to be seen how usable it will be on an actual cluster. After adding a print statement to the linked code:
println("launched.. and waiting..") spark.waitFor()
We do see:
launched.. and waiting..
Well this is probably a small step forward. Will update this question as I move towards a real clustered environment.
Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.
Spark Submit Python File Apache Spark binary comes with spark-submit.sh script file for Linux, Mac, and spark-submit. cmd command file for windows, these scripts are available at $SPARK_HOME/bin directory which is used to submit the PySpark file with . py extension (Spark with python) to the cluster.
You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client: Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.
Looking at the details of the pull request, it seems that the functionality is provided by the SparkLauncher
class, described in the API docs here.
public class SparkLauncher extends Object
Launcher for Spark applications.
Use this class to start Spark applications programmatically. The class uses a builder pattern to allow clients to configure the Spark application and launch it as a child process.
The API docs are rather minimal, but I found a blog post that gives a worked example (code also available in a GitHub repo). I have copied a simplified version of the example below (untested) in case the links go stale:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/user/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/user/example-assembly-1.0.jar")
.setMainClass("MySparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
See also:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With