Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

launching a spark program using oozie workflow

I am working with a scala program using spark packages. Currently I run the program using the bash command from the gateway: /homes/spark/bin/spark-submit --master yarn-cluster --class "com.xxx.yyy.zzz" --driver-java-options "-Dyyy.num=5" a.jar arg1 arg2

I would like to start using oozie for running this job. I have a few setbacks:

Where should I put the spark-submit executable? on the hfs? How do I define the spark action? where should the --driver-java-options appear? How should the oozie action look like? is it similar to the one appearing here?

like image 492
Shaharg Avatar asked Mar 24 '15 13:03

Shaharg


People also ask

How do you run a Spark job in Oozie?

To use Oozie Spark action with Spark 2 jobs, create a spark2 ShareLib directory, copy associated files into it, and then point Oozie to spark2 . (The Oozie ShareLib is a set of libraries that allow jobs to run on any node in a cluster.) To verify the configuration, run the Oozie shareliblist command.

Does Oozie support Spark?

Oozie is a workflow engine that executes sequences of actions structured as directed acyclic graphs (DAGs). Each action is an individual unit of work, such as a Spark job or Hive query. The Oozie "Spark action" runs a Spark job as part of an Oozie workflow.

How do I control workflow with Oozie?

There can be decision trees to decide how and on which condition a job should run. A fork is used to run multiple jobs in parallel. Oozie workflows can be parameterized (variables like ${nameNode} can be passed within the workflow definition). These parameters come from a configuration file called as property file.


1 Answers

If you have a new enough version of oozie you can use oozie's spark task:

https://github.com/apache/oozie/blob/master/client/src/main/resources/spark-action-0.1.xsd

Otherwise you need to execute a java task that will call spark. Something like:

   <java>
        <main-class>org.apache.spark.deploy.SparkSubmit</main-class>

        <arg>--class</arg>
        <arg>${spark_main_class}</arg> -> this is the class com.xxx.yyy.zzz

        <arg>--deploy-mode</arg>
        <arg>cluster</arg>

        <arg>--master</arg>
        <arg>yarn</arg>

        <arg>--queue</arg>
        <arg>${queue_name}</arg> -> depends on your oozie config

        <arg>--num-executors</arg>
        <arg>${spark_num_executors}</arg>

        <arg>--executor-cores</arg>
        <arg>${spark_executor_cores}</arg>

        <arg>${spark_app_file}</arg> -> jar that contains your spark job, written in scala

        <arg>${input}</arg> -> some arg 
        <arg>${output}</arg>-> some other arg

        <file>${spark_app_file}</file>

        <file>${name_node}/user/spark/share/lib/spark-assembly.jar</file>
    </java>
like image 79
nurieta Avatar answered Sep 21 '22 16:09

nurieta