Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow ignores resource pool flag when backfilling

Tags:

airflow

Airflow Backfill Sample

Command:

python dag.py backfill -i -t task1 --pool backfill -s "2016-05-29 03:00:00" -e "2016-06-07 00:00:00"

All the tasks get queue and all start running. Max capacity is essentially ignored.

like image 884
J.Fratzke Avatar asked Dec 16 '16 03:12

J.Fratzke


2 Answers

From what I know, pool oversubscription is supposed to be a known issue in 1.7.1.3 (latest stable release). Further, the Airflow backfill job runner doesn't respect pool constraints - only the Scheduler does and the scheduler doesn't schedule/deal with backfills. I think these are supposed to change in the next version - not sure though.

like image 56
Vineet Goel Avatar answered Sep 29 '22 10:09

Vineet Goel


Under the current release, 1.7.1.3 backfilling is, in my experience, almost always a bad idea. The scheduler can end up fighting with the backfill job, the backfilled DAG can enter odd states, and generally leave things in a smoking ruin.

Generally, I've found more success by making sure my jobs can distribute well across workers and finish in a reasonable time and trusting in the scheduler and the task start_date to carry the task through to completion.

This above does end up with some pretty horrible over-subscription of the number of DAG runs... and the scheduler tends to choke when it is past the configuration limit. The solution: bump the configuration limit for DAG runs temporarily. The scheduler and executor will tend to work reasonably well together to make sure you don't actually end up running too many jobs at the same time.

like image 33
russellpierce Avatar answered Sep 29 '22 11:09

russellpierce