Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running scheduled Spark job

Tags:

apache-spark

I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Currently this job is run manually using the spark-submit script. I want to schedule it to run every night so the results are pre-populated for the start of the day. Do I:

  1. Set up a cron job to call the spark-submit script?
  2. Add scheduling into my job class, so that it is submitted once but performs the actions every night?
  3. Is there a built-in mechanism in Spark or a separate script that will help me do this?

We are running Spark in Standalone mode.

Any suggestions appreciated!

like image 595
Matt Avatar asked May 21 '15 13:05

Matt


People also ask

How are Spark jobs scheduled?

By default, Spark's scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.

How do you automate a Spark job?

To use Oozie Spark action with Spark 2 jobs, create a spark2 ShareLib directory, copy associated files into it, and then point Oozie to spark2 . (The Oozie ShareLib is a set of libraries that allow jobs to run on any node in a cluster.) To verify the configuration, run the Oozie shareliblist command.

How do I know if I am running Spark jobs?

Click Analytics > Spark Analytics > Open the Spark Application Monitoring Page. Click Monitor > Workloads, and then click the Spark tab. This page displays the user names of the clusters that you are authorized to monitor and the number of applications that are currently running in each cluster.

How do I schedule a Spark job in airflow?

The steps involved in scheduling Spark Airflow Jobs are as follows: Scheduling Spark Airflow Jobs: Business Logic. Scheduling Spark Airflow Jobs: Diving into Airflow. Scheduling Spark Airflow Jobs: Building the DAG.


1 Answers

You can use a cron tab, but really as you start having spark jobs that depend on other spark jobs i would recommend pinball for coordination. https://github.com/pinterest/pinball

To get a simple crontab working I would create wrapper script such as

#!/bin/bash cd /locm/spark_jobs  export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_USER_NAME=hdfs export HADOOP_GROUP=hdfs  #export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*  CLASS=$1 MASTER=$2 ARGS=$3 CLASS_ARGS=$4 echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"  $SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 $ARGS spark-jobs-assembly*.jar $CLASS_ARGS >> /locm/spark_jobs/logs/$CLASS.log 2>&1 

Then create a crontab by

  1. crontab -e
  2. Insert 30 1 * * * /PATH/TO/SCRIPT.sh $CLASS "yarn-client"
like image 155
ben jarman Avatar answered Sep 22 '22 12:09

ben jarman