How to monitor Spark job with Airflow

Question

I set up a few dags, which eventually ends with a spark-submit command to a spark cluster. I'm using cluster mode if that makes a difference. Anyways, so my code works, but I realized if the spark job were to fail, I wouldn't necessarily know from within the Airflow UI. By triggering the job via cluster mode, Airflow hands off the job to an available worker, therefore airflow has no knowledge of the spark job.

How can I address this issue?

Him · Accepted Answer

Airflow (from version 1.8) has

SparkSqlOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_sql_operator.py ;
SparkSQLHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_sql_hook.py
SparkSubmitOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py
SparkSubmitHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py

If you use these, the airflow task will fail if the spark job fails. You might have to change the logging part in the spark_submit_hook file if you use spark1.x to get real time logs, because spark-submit logs even the errors to stdout for some of 1.x versions (I had to make changes for 1.6.1).

Also note that there have been many improvements to the SparkSubmitOperator since last stable release.

Derek Chan · Answer

You can consider using client mode, since the client will not terminate until the spark job is complete. Airflow executor can pick up the exit code.

Otherwise you may need to use a job server. Check out https://github.com/spark-jobserver/spark-jobserver

How to monitor Spark job with Airflow

Tags:

apache-spark

airflow

sdot257

2 Answers

Him

Derek Chan

Recent Activity

Donate For Us

How to monitor Spark job with Airflow

Tags:

apache-spark

airflow

sdot257

2 Answers

Him

Derek Chan

Related questions

Recent Activity

Donate For Us