I am new to Airflow and Spark and I am struggling with the SparkSubmitOperator. Our airflow scheduler and our hadoop cluster are not set up on the same machine (first question: is it a good practice?). We have many automatic procedures that need to call pyspark scripts. Those pyspark scripts are stored in the hadoop cluster (10.70.1.35). The airflow dags are stored in the airflow machine (10.70.1.22). Currently, when we want to spark-submit a pyspark script with airflow, we use a simple BashOperator as follows: <pre class="prettyprint"><code>cmd = "ssh hadoop@10.70.1.35 spark-submit \ --master yarn \ --deploy-mode cluster \ --executor-memory 2g \ --executor-cores 2 \ /home/hadoop/pyspark_script/script.py" t = BashOperator(task_id='Spark_datamodel',bash_command=cmd,dag=dag) </code></pre> It works perfectly fine. But we would like to start using SparkSubmitOperator to spark submit our pyspark scripts. I tried this: <pre class="prettyprint"><code>from airflow import DAG from datetime import timedelta, datetime from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator from airflow.operators.bash_operator import BashOperator from airflow.models import Variable dag = DAG('SPARK_SUBMIT_TEST',start_date=datetime(2018,12,10), schedule_interval='@daily') sleep = BashOperator(task_id='sleep', bash_command='sleep 10',dag=dag) _config ={'application':'hadoop@10.70.1.35:/home/hadoop/pyspark_script/test_spark_submit.py', 'master' : 'yarn', 'deploy-mode' : 'cluster', 'executor_cores': 1, 'EXECUTORS_MEM': '2G' } spark_submit_operator = SparkSubmitOperator( task_id='spark_submit_job', dag=dag, **_config) sleep.set_downstream(spark_submit_operator) </code></pre> The syntax should be ok as the dag does not show up as broken. But when it runs it gives me the following error: <pre class="prettyprint"><code>[2018-12-14 03:26:42,600] {logging_mixin.py:95} INFO - [2018-12-14 03:26:42,600] {base_hook.py:83} INFO - Using connection to: yarn [2018-12-14 03:26:42,974] {logging_mixin.py:95} INFO - [2018-12-14 03:26:42,973] {spark_submit_hook.py:283} INFO - Spark-Submit cmd: ['spark-submit', '--master', 'yarn', '--executor-cores', '1', '--name', 'airflow-spark', '--queue', 'root.default', 'hadoop@10.70.1.35:/home/hadoop/pyspark_script/test_spark_submit.py'] [2018-12-14 03:26:42,977] {models.py:1760} ERROR - [Errno 2] No such file or directory: 'spark-submit' Traceback (most recent call last): File "/home/dataetl/anaconda3/lib/python3.6/site- packages/airflow/models.py", line 1659, in _run_raw_task result = task_copy.execute(context=context) File "/home/dataetl/anaconda3/lib/python3.6/site- packages/airflow/contrib/operators/spark_submit_operator.py", line 168, in execute self._hook.submit(self._application) File "/home/dataetl/anaconda3/lib/python3.6/site- packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in submit **kwargs) File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line 707, in __init__ restore_signals, start_new_session) File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line 1326, in _execute_child raise child_exception_type(errno_num, err_msg) FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit' </code></pre> Here are my questions: <ol> <li>Should I install spark hadoop on my airflow machine? I'm asking because in this topic I read that I need to copy <code>hdfs-site.xml</code> and <code>hive-site.xml</code>. But as you can imagine, I have neither <code>/etc/hadoop/</code> nor <code>/etc/hive/</code> directories on my airflow machine.</li> <li>a) If no, where exactly should I copy <code>hdfs-site.xml</code> and <code>hive-site.xml</code> on my airflow machine?</li> <li>b) If yes, does it mean that I need to configure my airflow machine as a client? A kind of edge node that does not participate in jobs but can be used to submit actions? </li> <li>Then, will I be able to <code>spark-submit</code> from my airflow machine? If yes, then I don't need to create a connection on Airflow like I do for a mysql database for example, right?</li> <li>Oh and the cherry on the cake: will I be able to store my pyspark scripts in my airflow machine and <code>spark-submit</code> them from this same airflow machine. It would be amazing!</li> </ol> Any comment would be very useful, even if you're not able to answer all my questions... Thanks in advance anyway! :)

To answer your first question, yes it is a good practice. For how you can use <code>SparkSubmitOperator</code>, please refer to my answer on https://stackoverflow.com/a/53344713/5691525 <ol> <li> Yes, you need spark-binaries on airflow machine.</li> <li>-</li> <li>Yes</li> <li> No -> You still need a connection to tell Airflow where have you installed your spark binary files. Similar to https://stackoverflow.com/a/50541640/5691525 </li> <li>Should work</li> </ol>

Airflow SparkSubmitOperator - How to spark-submit in another server

Tags:

apache-spark

hadoop

airflow

I am new to Airflow and Spark and I am struggling with the SparkSubmitOperator.

Our airflow scheduler and our hadoop cluster are not set up on the same machine (first question: is it a good practice?).

We have many automatic procedures that need to call pyspark scripts. Those pyspark scripts are stored in the hadoop cluster (10.70.1.35). The airflow dags are stored in the airflow machine (10.70.1.22).

Currently, when we want to spark-submit a pyspark script with airflow, we use a simple BashOperator as follows:

cmd = "ssh [email protected] spark-submit \
   --master yarn \
   --deploy-mode cluster \
   --executor-memory 2g \
   --executor-cores 2 \
   /home/hadoop/pyspark_script/script.py"
t = BashOperator(task_id='Spark_datamodel',bash_command=cmd,dag=dag)

It works perfectly fine. But we would like to start using SparkSubmitOperator to spark submit our pyspark scripts.

I tried this:

from airflow import DAG
from datetime import timedelta, datetime
from airflow.contrib.operators.spark_submit_operator import 
SparkSubmitOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable

dag = DAG('SPARK_SUBMIT_TEST',start_date=datetime(2018,12,10), 
schedule_interval='@daily')


sleep = BashOperator(task_id='sleep', bash_command='sleep 10',dag=dag)

_config ={'application':'[email protected]:/home/hadoop/pyspark_script/test_spark_submit.py',
    'master' : 'yarn',
    'deploy-mode' : 'cluster',
    'executor_cores': 1,
    'EXECUTORS_MEM': '2G'
}

spark_submit_operator = SparkSubmitOperator(
    task_id='spark_submit_job',
    dag=dag,
    **_config)

sleep.set_downstream(spark_submit_operator)

The syntax should be ok as the dag does not show up as broken. But when it runs it gives me the following error:

[2018-12-14 03:26:42,600] {logging_mixin.py:95} INFO - [2018-12-14 
03:26:42,600] {base_hook.py:83} INFO - Using connection to: yarn
[2018-12-14 03:26:42,974] {logging_mixin.py:95} INFO - [2018-12-14 
03:26:42,973] {spark_submit_hook.py:283} INFO - Spark-Submit cmd: 
['spark-submit', '--master', 'yarn', '--executor-cores', '1', '--name', 
'airflow-spark', '--queue', 'root.default', 
'[email protected]:/home/hadoop/pyspark_script/test_spark_submit.py']
[2018-12-14 03:26:42,977] {models.py:1760} ERROR - [Errno 2] No such 
file or directory: 'spark-submit'
Traceback (most recent call last):
      File "/home/dataetl/anaconda3/lib/python3.6/site- 
   packages/airflow/models.py", line 1659, in _run_raw_task    
    result = task_copy.execute(context=context)
      File "/home/dataetl/anaconda3/lib/python3.6/site- 
   packages/airflow/contrib/operators/spark_submit_operator.py", line 
168, 
    in execute
        self._hook.submit(self._application)
      File "/home/dataetl/anaconda3/lib/python3.6/site- 
   packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in 
submit
        **kwargs)
      File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line 
707, 
    in __init__
        restore_signals, start_new_session)
      File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line 
    1326, in _execute_child
        raise child_exception_type(errno_num, err_msg)
    FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit'

Here are my questions:

Should I install spark hadoop on my airflow machine? I'm asking because in this topic I read that I need to copy hdfs-site.xml and hive-site.xml. But as you can imagine, I have neither /etc/hadoop/ nor /etc/hive/ directories on my airflow machine.
a) If no, where exactly should I copy hdfs-site.xml and hive-site.xml on my airflow machine?
b) If yes, does it mean that I need to configure my airflow machine as a client? A kind of edge node that does not participate in jobs but can be used to submit actions?
Then, will I be able to spark-submit from my airflow machine? If yes, then I don't need to create a connection on Airflow like I do for a mysql database for example, right?
Oh and the cherry on the cake: will I be able to store my pyspark scripts in my airflow machine and spark-submit them from this same airflow machine. It would be amazing!

Any comment would be very useful, even if you're not able to answer all my questions...

Thanks in advance anyway! :)

707

asked Dec 14 '18 04:12

V. Foy

1 Answers

To answer your first question, yes it is a good practice.

For how you can use SparkSubmitOperator, please refer to my answer on https://stackoverflow.com/a/53344713/5691525

Yes, you need spark-binaries on airflow machine.
-
Yes
No -> You still need a connection to tell Airflow where have you installed your spark binary files. Similar to https://stackoverflow.com/a/50541640/5691525
Should work

answered Oct 27 '22 01:10

kaxil

Related questions
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
                            
                                Output from Dataproc Spark job in Google Cloud Logging
                            
                                Read and write empty string "" vs NULL in Spark 2.0.1
                            
                                Apache Spark - Dealing with Sliding Windows on Temporal RDDs
                            
                                Caching intermediate results in Spark ML pipeline
                            
                                What is the correct way to start/stop spark streaming jobs in yarn?
                            
                                Spark Java Error: Size exceeds Integer.MAX_VALUE
                            
                                Dealing with a large gzipped file in Spark
                            
                                Extract document-topic matrix from Pyspark LDA Model
                            
                                local class incompatible Exception: when running spark standalone from IDE
                            
                                Disadvantages of Spark Dataset over DataFrame
                            
                                Why spark.ml don't implement any of spark.mllib algorithms?
                            
                                Preserve index-string correspondence spark string indexer
                            
                                How can set the default spark logging level?
                            
                                Meaning of Apache Spark warning "Calling spill() on RowBasedKeyValueBatch"
                            
                                Why is dataset.count causing a shuffle! (spark 2.2)
                            
                                Extract information from a `org.apache.spark.sql.Row`
                            
                                What is the right way to save\load models in Spark\PySpark
                            
                                How to run independent transformations in parallel using PySpark?
                            
                                How to limit functions.collect_set in Spark SQL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With