Is there a way to submit spark job on different server running master

Tags:

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.

Answer to this will be highly appreciated. Thanks in advance.

486

asked Nov 16 '18 19:11

Raghav salotra

1 Answers

There are 3 ways you can submit Spark jobs using Apache Airflow remotely:

(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.

You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.

Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.

If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath

(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.

(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.

I personally prefer SSHOperator :)

179

answered Sep 30 '22 02:09

kaxil

Related questions
                            
                                GroupByKey and create lists of values pyspark sql dataframe
                            
                                How to transform Spark Dataframe columns to a single column of a string array
                            
                                How to unpack multiple keys in a Spark DataSet
                            
                                Does Apache Spark SQL support MERGE clause?
                            
                                How do you display Dataframe column names sorted?
                            
                                Cumulative sum in Spark
                            
                                How to use approxQuantile by group?
                            
                                How to set jdbc/partitionColumn type to Date in spark 2.4.1
                            
                                Hbase 0.96 with Spark v 1.0+
                            
                                Writing a RDD to a csv
                            
                                Spark getting keys from key-value RDD
                            
                                How to fix "MetadataFetchFailedException: Missing an output location for shuffle"?
                            
                                Spark 2.0.0 Arrays.asList not working - incompatible types
                            
                                PySpark DataFrame - Join on multiple columns dynamically
                            
                                pyspark createdataframe: string interpreted as timestamp, schema mixes up columns
                            
                                Pyspark Removing null values from a column in dataframe
                            
                                How can I evaluate the implicit feedback ALS algorithm for recommendations in Apache Spark?
                            
                                add column from one dataframe to another dataframe in scala [duplicate]
                            
                                spark write to disk with N files less than N partitions
                            
                                Scala Spark - split vector column into separate columns in a Spark DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to submit spark job on different server running master

Tags:

apache-spark

pyspark

airflow

Raghav salotra

People also ask

1 Answers

kaxil

Recent Activity

Donate For Us