I am trying to automatically launch a Spark job on an HDInsight cluster from Microsoft Azure. I am aware that several methods exist to automate Hadoop job submission (provided by Azure itself), but so far I have not been able to found a way to remotely run a Spark job withouth setting a RDP with the master instance. Is there any way to achieve this?

Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. https://github.com/spark-jobserver/spark-jobserver My solution is using both Scheduler and Spark-jobserver to launch the Spark-job periodically.

At the moment of this writing, it seems there is no official way of achieving this. So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. It is nothing but a patch, but so far it has been useful for me. These are the steps I have followed: <h3>Prerequisites</h3> <ul> <li>Microsoft Powershell</li> <li>Azure Powershell</li> </ul> <h3>Process</h3> <h3>Define an Oozie workflow *.xml* file:</h3> <pre class="prettyprint"><code><workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2"> <start to = "myAction"/> <action name="myAction"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>myScript.cmd</exec> <file>wasb://myContainer@myAccount.blob.core.windows.net/myScript.cmd#myScript.cmd</file> <file>wasb://myContainer@myAccount.blob.core.windows.net/mySpark.jar#mySpark.jar</file> </shell> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> </code></pre> Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar, on the wasb repository. It is then redirectioned to the local directory on which the Oozie job is executing. <h3>Define the custom script</h3> <pre class="prettyprint"><code>C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass --master yarn-cluster --deploy-mode cluster --num-executors 3 --executor-memory 2g --executor-cores 4 mySpark.jar </code></pre> It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow: <code>wasb://myContainer@myAccount.blob.core.windows.net/</code> <h3>Define the powershell script</h3> The powershell script is very much taken from the official Oozie on HDInsight tutorial. I am not including the script on this answer due to its almost absolute sameness with my approach. I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission.

Remotely execute a Spark job on an HDInsight cluster

Tags:

apache-spark

azure

remote-access

azure-hdinsight

I am trying to automatically launch a Spark job on an HDInsight cluster from Microsoft Azure. I am aware that several methods exist to automate Hadoop job submission (provided by Azure itself), but so far I have not been able to found a way to remotely run a Spark job withouth setting a RDP with the master instance.

Is there any way to achieve this?

897

asked Feb 16 '15 13:02

Mikel Urkia

3 Answers

Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts.

https://github.com/spark-jobserver/spark-jobserver

My solution is using both Scheduler and Spark-jobserver to launch the Spark-job periodically.

163

answered Oct 17 '22 15:10

Hai Cu

At the moment of this writing, it seems there is no official way of achieving this. So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. It is nothing but a patch, but so far it has been useful for me. These are the steps I have followed:

Prerequisites

Microsoft Powershell
Azure Powershell

Process

Define an Oozie workflow .xml file:

<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2">
  <start to = "myAction"/>
  <action name="myAction">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>myScript.cmd</exec>
            <file>wasb://[email protected]/myScript.cmd#myScript.cmd</file>
            <file>wasb://[email protected]/mySpark.jar#mySpark.jar</file>
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar, on the wasb repository. It is then redirectioned to the local directory on which the Oozie job is executing.

Define the custom script

C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass
                                          --master yarn-cluster 
                                          --deploy-mode cluster 
                                          --num-executors 3 
                                          --executor-memory 2g 
                                          --executor-cores 4 
                                          mySpark.jar

It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow:

wasb://[email protected]/

Define the powershell script

The powershell script is very much taken from the official Oozie on HDInsight tutorial. I am not including the script on this answer due to its almost absolute sameness with my approach.

I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission.

answered Oct 17 '22 13:10

Mikel Urkia

Updated on 8/17/2016: Our spark cluster offering now includes a Livy server that provides a rest service to submit a spark job. You can automate spark job via Azure Data Factory as well.

Original post: 1) Remote job submission for spark is currently not supported.

2) If you want to automate setting a master every time ( i.e. adding --master yarn-client every time you execute), you can set the value in %SPARK_HOME\conf\spark-defaults.conf file with following config:

spark.master yarn-client

You can find more info on spark-defaults.conf on apache spark website.

3) Use cluster customization feature if you want to add this automatically to spark-defaults.conf file at deployment time.

answered Oct 17 '22 13:10

CatNinja

Related questions
                            
                                Set-AzureDeployment - ProtocolException
                            
                                My CSS images are not working on Windows Azure
                            
                                WaIISHost.exe.config vs. app.config for worker role config
                            
                                Can websites published to Azure have worker roles?
                            
                                Why does my custom VM Image not show up in Azure Create VM Interface?
                            
                                What does "A severe error occurred on the current command. The results, if any, should be discarded." SQL Azure error mean?
                            
                                Deploying to Azure from Visual Studio fails when connecting
                            
                                Azure Service Bus Queue Count
                            
                                Azure billing data API
                            
                                Alternatives to Azure Table Storage
                            
                                Difference between Topic DefaultMessageTimeToLive and Subscription DefaultMessageTimeToLive on Azure Service Bus
                            
                                MongoDB Out of Memory
                            
                                How to perform deployment swap on an Azure WebSite using Powershell?
                            
                                Mongo DB object Id deserializing using JSON serializer
                            
                                Failed to debug the Windows Azure Cloud Service Project. The output directory does not exist. / Azure SDK 2.3 / VS 2013
                            
                                cross domain cookies in azure development with ACS authentication
                            
                                retrieve list of blobs on Azure Storage by regular expression or wild cards
                            
                                How to query Azure DocumentDB with an equivalent of the SELECT DISTINCT statement
                            
                                How to register devices to Azure Notification Hub from server side(with NodeJS sdk) ?
                            
                                Custom Deployment to Azure Websites

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remotely execute a Spark job on an HDInsight cluster

Tags:

apache-spark

azure

remote-access

azure-hdinsight

Mikel Urkia

People also ask

3 Answers

Hai Cu

Prerequisites

Process

Define an Oozie workflow .xml file:

Define the custom script

Define the powershell script

Mikel Urkia

CatNinja

Recent Activity

Donate For Us