I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Currently this job is run manually using the spark-submit
script. I want to schedule it to run every night so the results are pre-populated for the start of the day. Do I:
spark-submit
script?We are running Spark in Standalone mode.
Any suggestions appreciated!
By default, Spark's scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
To use Oozie Spark action with Spark 2 jobs, create a spark2 ShareLib directory, copy associated files into it, and then point Oozie to spark2 . (The Oozie ShareLib is a set of libraries that allow jobs to run on any node in a cluster.) To verify the configuration, run the Oozie shareliblist command.
Click Analytics > Spark Analytics > Open the Spark Application Monitoring Page. Click Monitor > Workloads, and then click the Spark tab. This page displays the user names of the clusters that you are authorized to monitor and the number of applications that are currently running in each cluster.
The steps involved in scheduling Spark Airflow Jobs are as follows: Scheduling Spark Airflow Jobs: Business Logic. Scheduling Spark Airflow Jobs: Diving into Airflow. Scheduling Spark Airflow Jobs: Building the DAG.
You can use a cron tab, but really as you start having spark jobs that depend on other spark jobs i would recommend pinball for coordination. https://github.com/pinterest/pinball
To get a simple crontab working I would create wrapper script such as
#!/bin/bash cd /locm/spark_jobs export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_USER_NAME=hdfs export HADOOP_GROUP=hdfs #export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/* CLASS=$1 MASTER=$2 ARGS=$3 CLASS_ARGS=$4 echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS" $SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 $ARGS spark-jobs-assembly*.jar $CLASS_ARGS >> /locm/spark_jobs/logs/$CLASS.log 2>&1
Then create a crontab by
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With