I set up a few dags, which eventually ends with a spark-submit command to a spark cluster. I'm using cluster mode if that makes a difference. Anyways, so my code works, but I realized if the spark job were to fail, I wouldn't necessarily know from within the Airflow UI. By triggering the job via cluster mode, Airflow hands off the job to an available worker, therefore airflow has no knowledge of the spark job.
How can I address this issue?
Airflow (from version 1.8) has
SparkSqlOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_sql_operator.py ;
SparkSQLHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_sql_hook.py
SparkSubmitOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py
SparkSubmitHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py
If you use these, the airflow task will fail if the spark job fails. You might have to change the logging part in the spark_submit_hook file if you use spark1.x to get real time logs, because spark-submit logs even the errors to stdout for some of 1.x versions (I had to make changes for 1.6.1).
Also note that there have been many improvements to the SparkSubmitOperator since last stable release.
You can consider using client mode, since the client will not terminate until the spark job is complete. Airflow executor can pick up the exit code.
Otherwise you may need to use a job server. Check out https://github.com/spark-jobserver/spark-jobserver
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With