I have been of late trying out apache spark. My question is more specific to trigger spark jobs. Here I had posted question on understanding spark jobs. After getting dirty on jobs I moved on to my requirement.
I have a REST end point where I expose API to trigger Jobs, I have used Spring4.0 for Rest Implementation. Now going ahead I thought of implementing Jobs as Service in Spring where I would submit Job programmatically, meaning when the endpoint is triggered, with given parameters I would trigger the job. I have now few design options.
Similar to the below written job, I need to maintain several Jobs called by a Abstract Class may be JobScheduler
.
/*Can this Code be abstracted from the application and written as as a seperate job. Because my understanding is that the Application code itself has to have the addJars embedded which internally sparkContext takes care.*/ SparkConf sparkConf = new SparkConf().setAppName("MyApp").setJars( new String[] { "/path/to/jar/submit/cluster" }) .setMaster("/url/of/master/node"); sparkConf.setSparkHome("/path/to/spark/"); sparkConf.set("spark.scheduler.mode", "FAIR"); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.setLocalProperty("spark.scheduler.pool", "test"); // Application with Algorithm , transformations
extending above point have multiple versions of jobs handled by service.
Or else use an Spark Job Server to do this.
Firstly, I would like to know what is the best solution in this case, execution wise and also scaling wise.
Note : I am using a standalone cluster from spark. kindly help.
You can submit any Spark application that runs Spark SQL or data transformation, data science and machine learning jobs by using the Spark jobs REST API. Each submitted job runs in a dedicated cluster. Any configuration settings that you pass through the jobs API, will override the default configurations.
You can run a Spark REST API job by using the Spark Standalone mode. One of the REST API rules states that you will get a piece of data (resource) when linked to a URL (endpoint). You can send requests by using the command line utility called cURL.
When show is triggered on dataset, it gets converted to head(20) action which in turn get converted to limit(20) action . Spark executes limit in an incremental fashion until the limit query is satisfied.
It turns out Spark has a hidden REST API to submit a job, check status and kill.
Check out full example here: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With