Running scheduled Spark job

Tags:

apache-spark

I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Currently this job is run manually using the spark-submit script. I want to schedule it to run every night so the results are pre-populated for the start of the day. Do I:

Set up a cron job to call the spark-submit script?
Add scheduling into my job class, so that it is submitted once but performs the actions every night?
Is there a built-in mechanism in Spark or a separate script that will help me do this?

We are running Spark in Standalone mode.

Any suggestions appreciated!

595

asked May 21 '15 13:05

Matt

1 Answers

You can use a cron tab, but really as you start having spark jobs that depend on other spark jobs i would recommend pinball for coordination. https://github.com/pinterest/pinball

To get a simple crontab working I would create wrapper script such as

#!/bin/bash cd /locm/spark_jobs  export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_USER_NAME=hdfs export HADOOP_GROUP=hdfs  #export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*  CLASS=$1 MASTER=$2 ARGS=$3 CLASS_ARGS=$4 echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"  $SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 $ARGS spark-jobs-assembly*.jar $CLASS_ARGS >> /locm/spark_jobs/logs/$CLASS.log 2>&1

Then create a crontab by

crontab -e
Insert 30 1 * * * /PATH/TO/SCRIPT.sh $CLASS "yarn-client"

155

answered Sep 22 '22 12:09

ben jarman

Related questions
                            
                                Dealing with unbalanced datasets in Spark MLlib
                            
                                Spark DataFrame - Select n random rows
                            
                                How to create SparkSession from existing SparkContext
                            
                                How to sort an RDD in Scala Spark?
                            
                                map vs mapValues in Spark
                            
                                How do I use multiple conditions with pyspark.sql.functions.when()?
                            
                                Replace empty strings with None/null values in DataFrame
                            
                                Increase memory available to PySpark at runtime
                            
                                how to convert json string to dataframe on spark
                            
                                Difference in dense rank and row number in spark
                            
                                How to set Master address for Spark examples from command line
                            
                                Querying on multiple Hive stores using Apache Spark
                            
                                Concatenating datasets of different RDDs in Apache spark using scala
                            
                                How to know which piece of code runs on driver or executor?
                            
                                What is the difference between Spark Standalone, YARN and local mode?
                            
                                How to create correct data frame for classification in Spark ML
                            
                                PySpark dataframe convert unusual string format to Timestamp
                            
                                Save Spark dataframe as dynamic partitioned table in Hive
                            
                                Change nullable property of column in spark dataframe
                            
                                Reading DataFrame from partitioned parquet file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With