Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Practice to launch Spark Applications via Web Application?

Tags:

apache-spark

I want to expose my Spark applications to the users with a web application.

Basically, the user can decide which action he wants to run and enter a few variables, which need to get passed to the spark application. For example: The user enters a few fields and then clicks on a button which does the following "run sparkApp1 with paramter min_x, max_x, min_y, max_y".

The spark application should be launched with the parameters given by the user. After finishing, the web application might be needed to retrieve the results (from hdfs or mongodb) and display them to the user. While processing, the Web Application should display the status of the Spark Application.

My question:

  • How can the web application launch the Spark Application? It might be able to launch it from the command line under the hood but there might be a better way to do this.
  • How can the web application access the current status of the Spark Application? Is fetching the status from the Spark WebUI's REST API the way to go?

I'm running a cluster of Spark 1.6.1 with YARN/Mesos (not sure yet) and MongoDB.

like image 856
j9dy Avatar asked Oct 28 '16 07:10

j9dy


People also ask

What command is used to launch a Spark application?

Apache Spark - Deployment. Spark application, using spark-submit, is a shell command used to deploy the Spark application on a cluster.

Which OCI service should you use to run Apache spark applications?

Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Big Data service that lets you run Apache Spark applications at any scale with no administration.

What happens when we submit a Spark application?

Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.


2 Answers

Very basic answer:

Basically you can use SparkLauncher class to launch Spark applications and add some listeners to watch progress.

However you may be interested in Livy server, which is a RESTful Sever for Spark jobs. As far as I know, Zeppelin is using Livy to submit jobs and retrieve status.

You can also use Spark REST interface to check state, information will be then more precise. Here there is an example how to submit job via REST API

You've got 3 options, the answer is - check by yourself ;) It very depends on your project and requirements. Both 2 main options:

  • SparkLauncher + Spark REST interface
  • Livy server

Should be good for you and you must just check what's easier and better to use in your project

Extended answer

You can use Spark from your application in different ways, depending on what you need and what you prefer.

SparkLauncher

SparkLauncher is a class from spark-launcher artifact. It is used to launch already prepared Spark jobs just like from Spark Submit.

Typical usage is:

1) Build project with your Spark job and copy JAR file to all nodes 2) From your client application, i.e. web application, create SparkLauncher which points to prepared JAR file

SparkAppHandle handle = new SparkLauncher()     .setSparkHome(SPARK_HOME)     .setJavaHome(JAVA_HOME)     .setAppResource(pathToJARFile)     .setMainClass(MainClassFromJarWithJob)     .setMaster("MasterAddress     .startApplication();     // or: .launch().waitFor() 

startApplication creates SparkAppHandle which allows you to add listeners and stop application. It also provides possibility to getAppId.

SparkLauncher should be used with Spark REST API. You can query http://driverNode:4040/api/v1/applications/*ResultFromGetAppId*/jobs and you will have information about current status of an application.

Spark REST API

There is also possibility to submit Spark jobs directly via RESTful API. Usage is very similar to SparkLauncher, but it's done in pure RESTful way.

Example request - credits for this article :

curl -X POST http://spark-master-host:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{   "action" : "CreateSubmissionRequest",   "appArgs" : [ "myAppArgument1" ],   "appResource" : "hdfs:///filepath/spark-job-1.0.jar",   "clientSparkVersion" : "1.5.0",   "environmentVariables" : {     "SPARK_ENV_LOADED" : "1"   },   "mainClass" : "spark.ExampleJobInPreparedJar",   "sparkProperties" : {     "spark.jars" : "hdfs:///filepath/spark-job-1.0.jar",     "spark.driver.supervise" : "false",     "spark.app.name" : "ExampleJobInPreparedJar",     "spark.eventLog.enabled": "true",     "spark.submit.deployMode" : "cluster",     "spark.master" : "spark://spark-cluster-ip:6066"   } }' 

This command will submit job in ExampleJobInPreparedJar class to cluster with given Spark Master. In the response you will have submissionId field, which will be helpful to check status of application - simply call another service: curl http://spark-cluster-ip:6066/v1/submissions/status/submissionIdFromResponse. That's it, nothing more to code

Livy REST Server and Spark Job Server

Livy REST Server and Spark Job Server are RESTful applications which allows you to submit jobs via RESTful Web Service. One major difference between those two and Spark's REST interface is that Livy and SJS doesn't require jobs to be prepared earlier and packed to JAR file. You are just submitting code which will be executed in Spark.

Usage is very simple. Codes are taken from Livy repository, but with some cuts to improve readability

1) Case 1: submitting job, that is placed in local machine

// creating client LivyClient client = new LivyClientBuilder()   .setURI(new URI(livyUrl))   .build();  try {   // sending and submitting JAR file   client.uploadJar(new File(piJar)).get();   // PiJob is a class that implements Livy's Job   double pi = client.submit(new PiJob(samples)).get(); } finally {   client.stop(true); } 

2) Case 2: dynamic job creation and execution

// example in Python. Data contains code in Scala, that will be executed in Spark data = {   'code': textwrap.dedent("""\     val NUM_SAMPLES = 100000;     val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>       val x = Math.random();       val y = Math.random();       if (x*x + y*y < 1) 1 else 0     }.reduce(_ + _);     println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)     """) }  r = requests.post(statements_url, data=json.dumps(data), headers=headers) pprint.pprint(r.json())  

As you can see, both pre-compiled jobs and ad - hoc queries to Spark are possible.

Hydrosphere Mist

Another Spark as a Service application. Mist is very simple and similar to Livy and Spark Job Server.

Usage is very very similar

1) Create job file:

import io.hydrosphere.mist.MistJob  object MyCoolMistJob extends MistJob {     def doStuff(parameters: Map[String, Any]): Map[String, Any] = {         val rdd = context.parallelize()         ...         return result.asInstance[Map[String, Any]]     } }  

2) Package job file into JAR 3) Send request to Mist:

curl --header "Content-Type: application/json" -X POST http://mist_http_host:mist_http_port/jobs --data '{"path": "/path_to_jar/mist_examples.jar", "className": "SimpleContext$", "parameters": {"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}, "namespace": "foo"}' 

One strong thing, that I can see in Mist, is that it has out-of-the-box support for streaming jobs via MQTT.

Apache Toree

Apache Toree was created to enable easy interactive analitics for Spark. It doesn't require any JAR to be built. It's working via IPython protocol, but not only Python is supported.

Currently documentation focuses on Jupyter notebook support, but there is also REST-style API.

Comparison and conclusions

I've listed few options:

  1. SparkLauncher
  2. Spark REST API
  3. Livy REST Server and Spark Job Server
  4. Hydrosphere Mist
  5. Apache Toree

All of them are good for different use cases. I can distinguish few categories:

  1. Tools that requires JAR files with job: Spark Launcher, Spark REST API
  2. Tools for interactive and pre-packaged jobs: Livy, SJS, Mist
  3. Tools that focus on interactive analitics: Toree (however there may be some support for pre-packaged jobs; no documentation is published at this moment)

SparkLauncher is very simple and is a part of Spark project. You are writing job configuration in plain code, so it can be easier to build than JSON objects.

For fully RESTful-style submitting, consider Spark REST API, Livy, SJS and Mist. Three of them are stable projects, which have some production use cases. REST API also requires jobs to be pre-packaged and Livy and SJS don't. However remember, that Spark REST API is by default in each Spark distribution and Livy/SJS is not. I don't know much about Mist, but - after a while - it should be very good tool to integrate all types of Spark jobs.

Toree is focusing on interactive jobs. It's still in incubation, but even now you can check it's possibilities.

Why use custom, additional REST Service, when there is built-in REST API? SaaS like Livy is one entry point to Spark. It manages Spark context and is only on one node than can in other place than cluster. They also enables interactive analytics. Apache Zeppelin uses Livy to submit user's code to Spark

like image 164
T. Gawęda Avatar answered Oct 26 '22 14:10

T. Gawęda


Here an example of SparkLauncher T.Gawęda mentioned:

SparkAppHandle handle = new SparkLauncher()     .setSparkHome(SPARK_HOME)     .setJavaHome(JAVA_HOME)     .setAppResource(SPARK_JOB_JAR_PATH)     .setMainClass(SPARK_JOB_MAIN_CLASS)     .addAppArgs("arg1", "arg2")     .setMaster("yarn-cluster")     .setConf("spark.dynamicAllocation.enabled", "true")     .startApplication(); 

Here you can find an example of java web application with Spark job bundled together in a single project. Through SparkLauncher you can get SparkAppHandle which you can use to get info about job status. If you need a progress status you can use Spark rest-api:

http://driverHost:4040/api/v1/applications/[app-id]/jobs 

The only dependency you will need for SparkLauncher:

<dependency>     <groupId>org.apache.spark</groupId>     <artifactId>spark-launcher_2.10</artifactId>     <version>2.0.1</version> </dependency> 
like image 37
MaxNevermind Avatar answered Oct 26 '22 13:10

MaxNevermind