Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

airflow health check

The airflow I'm using, sometimes the pipelines wait for a long time to be scheduled. There have also been instances where a job was running for too long (presumably taking up resources of other jobs)

I'm trying to work out how to programatically identify the health of the scheduler and potentially monitor those in the future without any additional frameworks. I started to have a look at the metadata database tables. All I can think of now is to see start_date and end_date from dag_run, and duration of the tasks. What are the other metrics that I should be looking at? Many thanks for your help.

like image 506
user4046073 Avatar asked Dec 17 '25 21:12

user4046073


1 Answers

There is no need to go "deep" inside the database.

Airflow provide you with metrics that you can utilize for the very purpose: https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html

If you scroll down, you will see all the useful metrics and some of them are precisely what you are looking for (especially Timers).

This can be done with the usual metrics integration. Airflow publishes the metrics via statsd, and Airflow Official Helm Chart (https://airflow.apache.org/docs/helm-chart/stable/index.html) even exposes those metrics for Prometheus via statsd exporter.

Regarding the spark job - yeah - current implementation of spark submit hook/operator is implemented in "active poll" mode. The "worker" process of airflow polls the status of the job. But Airlfow can run multiple worker jobs in parallel. Also if you want, you can implement your own task which will behave differently.

In "classic" Airflow you'd need to implement a Submit Operator (to submit the job) and "poke_reschedule" sensor (to wait for the job to complete) and implement your DAG in the way that sensort task will be triggered after the operator. The "Poke reschedule" mode works in the way that the sensor is only taking the worker slot for the time of "polling" and then it frees the slot for some time (until it checks again).

As of Airflow 2.2 you can also write a Deferrable Operator (https://airflow.apache.org/docs/apache-airflow/stable/concepts/deferring.html?highlight=deferrable) where you could write single Operator - doing submision first, and then deferring the status check - all in one operator. Defferrable operators are efficiently handling (using async.io) potentially multiple thousands of waiting/deferred operators without taking slots or excessive resources.

Update: If you really cannot use statsd (helm is not needed, statsd is enough) you should never use DB to get information about the DAGS. Use Stable Airflow REST API instead: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

like image 157
Jarek Potiuk Avatar answered Dec 20 '25 21:12

Jarek Potiuk



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!