Best Practice to launch Spark Applications via Web Application?

Tags:

apache-spark

I want to expose my Spark applications to the users with a web application.

Basically, the user can decide which action he wants to run and enter a few variables, which need to get passed to the spark application. For example: The user enters a few fields and then clicks on a button which does the following "run sparkApp1 with paramter min_x, max_x, min_y, max_y".

The spark application should be launched with the parameters given by the user. After finishing, the web application might be needed to retrieve the results (from hdfs or mongodb) and display them to the user. While processing, the Web Application should display the status of the Spark Application.

My question:

How can the web application launch the Spark Application? It might be able to launch it from the command line under the hood but there might be a better way to do this.
How can the web application access the current status of the Spark Application? Is fetching the status from the Spark WebUI's REST API the way to go?

I'm running a cluster of Spark 1.6.1 with YARN/Mesos (not sure yet) and MongoDB.

856

asked Oct 28 '16 07:10

j9dy

2 Answers

Very basic answer:

Basically you can use SparkLauncher class to launch Spark applications and add some listeners to watch progress.

However you may be interested in Livy server, which is a RESTful Sever for Spark jobs. As far as I know, Zeppelin is using Livy to submit jobs and retrieve status.

You can also use Spark REST interface to check state, information will be then more precise. Here there is an example how to submit job via REST API

You've got 3 options, the answer is - check by yourself ;) It very depends on your project and requirements. Both 2 main options:

SparkLauncher + Spark REST interface
Livy server

Should be good for you and you must just check what's easier and better to use in your project

Extended answer

You can use Spark from your application in different ways, depending on what you need and what you prefer.

SparkLauncher

SparkLauncher is a class from spark-launcher artifact. It is used to launch already prepared Spark jobs just like from Spark Submit.

Typical usage is:

1) Build project with your Spark job and copy JAR file to all nodes 2) From your client application, i.e. web application, create SparkLauncher which points to prepared JAR file

SparkAppHandle handle = new SparkLauncher()     .setSparkHome(SPARK_HOME)     .setJavaHome(JAVA_HOME)     .setAppResource(pathToJARFile)     .setMainClass(MainClassFromJarWithJob)     .setMaster("MasterAddress     .startApplication();     // or: .launch().waitFor()

startApplication creates SparkAppHandle which allows you to add listeners and stop application. It also provides possibility to getAppId.

SparkLauncher should be used with Spark REST API. You can query http://driverNode:4040/api/v1/applications/*ResultFromGetAppId*/jobs and you will have information about current status of an application.

Spark REST API

There is also possibility to submit Spark jobs directly via RESTful API. Usage is very similar to SparkLauncher, but it's done in pure RESTful way.

Example request - credits for this article :

curl -X POST http://spark-master-host:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{   "action" : "CreateSubmissionRequest",   "appArgs" : [ "myAppArgument1" ],   "appResource" : "hdfs:///filepath/spark-job-1.0.jar",   "clientSparkVersion" : "1.5.0",   "environmentVariables" : {     "SPARK_ENV_LOADED" : "1"   },   "mainClass" : "spark.ExampleJobInPreparedJar",   "sparkProperties" : {     "spark.jars" : "hdfs:///filepath/spark-job-1.0.jar",     "spark.driver.supervise" : "false",     "spark.app.name" : "ExampleJobInPreparedJar",     "spark.eventLog.enabled": "true",     "spark.submit.deployMode" : "cluster",     "spark.master" : "spark://spark-cluster-ip:6066"   } }'

This command will submit job in ExampleJobInPreparedJar class to cluster with given Spark Master. In the response you will have submissionId field, which will be helpful to check status of application - simply call another service: curl http://spark-cluster-ip:6066/v1/submissions/status/submissionIdFromResponse. That's it, nothing more to code

Livy REST Server and Spark Job Server

Livy REST Server and Spark Job Server are RESTful applications which allows you to submit jobs via RESTful Web Service. One major difference between those two and Spark's REST interface is that Livy and SJS doesn't require jobs to be prepared earlier and packed to JAR file. You are just submitting code which will be executed in Spark.

Usage is very simple. Codes are taken from Livy repository, but with some cuts to improve readability

1) Case 1: submitting job, that is placed in local machine

// creating client LivyClient client = new LivyClientBuilder()   .setURI(new URI(livyUrl))   .build();  try {   // sending and submitting JAR file   client.uploadJar(new File(piJar)).get();   // PiJob is a class that implements Livy's Job   double pi = client.submit(new PiJob(samples)).get(); } finally {   client.stop(true); }

2) Case 2: dynamic job creation and execution

// example in Python. Data contains code in Scala, that will be executed in Spark data = {   'code': textwrap.dedent("""\     val NUM_SAMPLES = 100000;     val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>       val x = Math.random();       val y = Math.random();       if (x*x + y*y < 1) 1 else 0     }.reduce(_ + _);     println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)     """) }  r = requests.post(statements_url, data=json.dumps(data), headers=headers) pprint.pprint(r.json())

As you can see, both pre-compiled jobs and ad - hoc queries to Spark are possible.

Hydrosphere Mist

Another Spark as a Service application. Mist is very simple and similar to Livy and Spark Job Server.

Usage is very very similar

1) Create job file:

import io.hydrosphere.mist.MistJob  object MyCoolMistJob extends MistJob {     def doStuff(parameters: Map[String, Any]): Map[String, Any] = {         val rdd = context.parallelize()         ...         return result.asInstance[Map[String, Any]]     } }

2) Package job file into JAR 3) Send request to Mist:

curl --header "Content-Type: application/json" -X POST http://mist_http_host:mist_http_port/jobs --data '{"path": "/path_to_jar/mist_examples.jar", "className": "SimpleContext$", "parameters": {"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}, "namespace": "foo"}'

One strong thing, that I can see in Mist, is that it has out-of-the-box support for streaming jobs via MQTT.

Apache Toree

Apache Toree was created to enable easy interactive analitics for Spark. It doesn't require any JAR to be built. It's working via IPython protocol, but not only Python is supported.

Currently documentation focuses on Jupyter notebook support, but there is also REST-style API.

Comparison and conclusions

I've listed few options:

SparkLauncher
Spark REST API
Livy REST Server and Spark Job Server
Hydrosphere Mist
Apache Toree

All of them are good for different use cases. I can distinguish few categories:

Tools that requires JAR files with job: Spark Launcher, Spark REST API
Tools for interactive and pre-packaged jobs: Livy, SJS, Mist
Tools that focus on interactive analitics: Toree (however there may be some support for pre-packaged jobs; no documentation is published at this moment)

SparkLauncher is very simple and is a part of Spark project. You are writing job configuration in plain code, so it can be easier to build than JSON objects.

For fully RESTful-style submitting, consider Spark REST API, Livy, SJS and Mist. Three of them are stable projects, which have some production use cases. REST API also requires jobs to be pre-packaged and Livy and SJS don't. However remember, that Spark REST API is by default in each Spark distribution and Livy/SJS is not. I don't know much about Mist, but - after a while - it should be very good tool to integrate all types of Spark jobs.

Toree is focusing on interactive jobs. It's still in incubation, but even now you can check it's possibilities.

Why use custom, additional REST Service, when there is built-in REST API? SaaS like Livy is one entry point to Spark. It manages Spark context and is only on one node than can in other place than cluster. They also enables interactive analytics. Apache Zeppelin uses Livy to submit user's code to Spark

164

answered Oct 26 '22 14:10

T. Gawęda

Here an example of SparkLauncher T.Gawęda mentioned:

SparkAppHandle handle = new SparkLauncher()     .setSparkHome(SPARK_HOME)     .setJavaHome(JAVA_HOME)     .setAppResource(SPARK_JOB_JAR_PATH)     .setMainClass(SPARK_JOB_MAIN_CLASS)     .addAppArgs("arg1", "arg2")     .setMaster("yarn-cluster")     .setConf("spark.dynamicAllocation.enabled", "true")     .startApplication();

Here you can find an example of java web application with Spark job bundled together in a single project. Through SparkLauncher you can get SparkAppHandle which you can use to get info about job status. If you need a progress status you can use Spark rest-api:

http://driverHost:4040/api/v1/applications/[app-id]/jobs

The only dependency you will need for SparkLauncher:

<dependency>     <groupId>org.apache.spark</groupId>     <artifactId>spark-launcher_2.10</artifactId>     <version>2.0.1</version> </dependency>

answered Oct 26 '22 13:10

MaxNevermind

Related questions
                            
                                Reading TSV into Spark Dataframe with Scala API
                            
                                spark createOrReplaceTempView vs createGlobalTempView
                            
                                How to calculate date difference in pyspark?
                            
                                How to convert Timestamp to Date format in DataFrame?
                            
                                Failed to Read Artifact Descriptor: IntelliJ
                            
                                Spark: How to kill running process without exiting shell?
                            
                                Syntax while setting schema for Pyspark.sql using StructType
                            
                                Efficient string matching in Apache Spark
                            
                                How to pass whole Row to UDF - Spark DataFrame filter
                            
                                How to perform one operation on each executor once in spark
                            
                                SPARK SQL - update MySql table using DataFrames and JDBC
                            
                                Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]
                            
                                How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?
                            
                                Why does Spark job fail with "too many open files"?
                            
                                How do I run graphx with Python / pyspark?
                            
                                What is the difference between sort and orderBy functions in Spark
                            
                                Shipping Python modules in pyspark to other nodes
                            
                                How to do left outer join in spark sql?
                            
                                Spark dataframe get column value into a string variable
                            
                                Differences between null and NaN in spark? How to deal with it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With