I am new to Airflow and Spark and I am struggling with the SparkSubmitOperator.
Our airflow scheduler and our hadoop cluster are not set up on the same machine (first question: is it a good practice?).
We have many automatic procedures that need to call pyspark scripts. Those pyspark scripts are stored in the hadoop cluster (10.70.1.35). The airflow dags are stored in the airflow machine (10.70.1.22).
Currently, when we want to spark-submit a pyspark script with airflow, we use a simple BashOperator as follows:
cmd = "ssh [email protected] spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 2g \
--executor-cores 2 \
/home/hadoop/pyspark_script/script.py"
t = BashOperator(task_id='Spark_datamodel',bash_command=cmd,dag=dag)
It works perfectly fine. But we would like to start using SparkSubmitOperator to spark submit our pyspark scripts.
I tried this:
from airflow import DAG
from datetime import timedelta, datetime
from airflow.contrib.operators.spark_submit_operator import
SparkSubmitOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
dag = DAG('SPARK_SUBMIT_TEST',start_date=datetime(2018,12,10),
schedule_interval='@daily')
sleep = BashOperator(task_id='sleep', bash_command='sleep 10',dag=dag)
_config ={'application':'[email protected]:/home/hadoop/pyspark_script/test_spark_submit.py',
'master' : 'yarn',
'deploy-mode' : 'cluster',
'executor_cores': 1,
'EXECUTORS_MEM': '2G'
}
spark_submit_operator = SparkSubmitOperator(
task_id='spark_submit_job',
dag=dag,
**_config)
sleep.set_downstream(spark_submit_operator)
The syntax should be ok as the dag does not show up as broken. But when it runs it gives me the following error:
[2018-12-14 03:26:42,600] {logging_mixin.py:95} INFO - [2018-12-14
03:26:42,600] {base_hook.py:83} INFO - Using connection to: yarn
[2018-12-14 03:26:42,974] {logging_mixin.py:95} INFO - [2018-12-14
03:26:42,973] {spark_submit_hook.py:283} INFO - Spark-Submit cmd:
['spark-submit', '--master', 'yarn', '--executor-cores', '1', '--name',
'airflow-spark', '--queue', 'root.default',
'[email protected]:/home/hadoop/pyspark_script/test_spark_submit.py']
[2018-12-14 03:26:42,977] {models.py:1760} ERROR - [Errno 2] No such
file or directory: 'spark-submit'
Traceback (most recent call last):
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/models.py", line 1659, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/contrib/operators/spark_submit_operator.py", line
168,
in execute
self._hook.submit(self._application)
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in
submit
**kwargs)
File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line
707,
in __init__
restore_signals, start_new_session)
File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line
1326, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit'
Here are my questions:
Should I install spark hadoop on my airflow machine? I'm asking because in this topic I read that I need to copy hdfs-site.xml
and hive-site.xml
. But as you can imagine, I have neither /etc/hadoop/
nor /etc/hive/
directories on my airflow machine.
a) If no, where exactly should I copy hdfs-site.xml
and hive-site.xml
on my airflow machine?
b) If yes, does it mean that I need to configure my airflow machine as a client? A kind of edge node that does not participate in jobs but can be used to submit actions?
Then, will I be able to spark-submit
from my airflow machine? If yes, then I don't need to create a connection on Airflow like I do for a mysql database for example, right?
Oh and the cherry on the cake: will I be able to store my pyspark scripts in my airflow machine and spark-submit
them from this same airflow machine. It would be amazing!
Any comment would be very useful, even if you're not able to answer all my questions...
Thanks in advance anyway! :)
No, the spark-submit parameters num-executors , executor-cores , executor-memory won't work in local mode because these parameters are to be used when you deploy your spark job on a cluster and not a single machine, these will only work in case you run your job in client or cluster mode.
To answer your first question, yes it is a good practice.
For how you can use SparkSubmitOperator
, please refer to my answer on https://stackoverflow.com/a/53344713/5691525
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With