Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remotely execute a Spark job on an HDInsight cluster

I am trying to automatically launch a Spark job on an HDInsight cluster from Microsoft Azure. I am aware that several methods exist to automate Hadoop job submission (provided by Azure itself), but so far I have not been able to found a way to remotely run a Spark job withouth setting a RDP with the master instance.

Is there any way to achieve this?

like image 897
Mikel Urkia Avatar asked Feb 16 '15 13:02

Mikel Urkia


People also ask

How do I submit a spark job to HDInsight cluster?

You can use the spark-submit command to submit . NET for Apache Spark jobs to Azure HDInsight. Navigate to your HDInsight Spark cluster in Azure portal, and then select SSH + Cluster login. Copy the ssh login information and paste the login into a terminal.

How do I SSH into HDInsight cluster?

Create HDInsight using a password To use a different password, uncheck Use cluster login password for SSH, and then enter the password in the SSH password field. Use the --SshCredential parameter of the New-AzHdinsightCluster cmdlet and pass a PSCredential object that contains the SSH user account name and password.

How do I submit a spark job in Azure?

Submit Spark job If the Spark job has reference Jars, Py files or additional files, click the ADVANCED tab and enter the corresponding file paths. Click Submit to submit Spark job.

Which Machine Learning Library can you use from a spark cluster in Azure HDInsight?

Use Anaconda scikit-learn library for Spark machine learning Apache Spark clusters in HDInsight include Anaconda libraries. It also includes the scikit-learn library for machine learning. The library also includes various data sets that you can use to build sample applications directly from a Jupyter Notebook.


3 Answers

Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts.

https://github.com/spark-jobserver/spark-jobserver

My solution is using both Scheduler and Spark-jobserver to launch the Spark-job periodically.

like image 163
Hai Cu Avatar answered Oct 17 '22 15:10

Hai Cu


At the moment of this writing, it seems there is no official way of achieving this. So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. It is nothing but a patch, but so far it has been useful for me. These are the steps I have followed:

Prerequisites

  • Microsoft Powershell
  • Azure Powershell

Process

Define an Oozie workflow *.xml* file:

<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2">
  <start to = "myAction"/>
  <action name="myAction">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>myScript.cmd</exec>
            <file>wasb://[email protected]/myScript.cmd#myScript.cmd</file>
            <file>wasb://[email protected]/mySpark.jar#mySpark.jar</file>
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>   

Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar, on the wasb repository. It is then redirectioned to the local directory on which the Oozie job is executing.

Define the custom script

C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass
                                          --master yarn-cluster 
                                          --deploy-mode cluster 
                                          --num-executors 3 
                                          --executor-memory 2g 
                                          --executor-cores 4 
                                          mySpark.jar  

It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow:

wasb://[email protected]/

Define the powershell script

The powershell script is very much taken from the official Oozie on HDInsight tutorial. I am not including the script on this answer due to its almost absolute sameness with my approach.

I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission.

like image 40
Mikel Urkia Avatar answered Oct 17 '22 13:10

Mikel Urkia


Updated on 8/17/2016: Our spark cluster offering now includes a Livy server that provides a rest service to submit a spark job. You can automate spark job via Azure Data Factory as well.


Original post: 1) Remote job submission for spark is currently not supported.

2) If you want to automate setting a master every time ( i.e. adding --master yarn-client every time you execute), you can set the value in %SPARK_HOME\conf\spark-defaults.conf file with following config:

spark.master yarn-client

You can find more info on spark-defaults.conf on apache spark website.

3) Use cluster customization feature if you want to add this automatically to spark-defaults.conf file at deployment time.

like image 27
CatNinja Avatar answered Oct 17 '22 13:10

CatNinja