I recently upgraded from v1.7.1.2 to v1.9.0 and after the upgrade I noticed that the CPU usage increased significantly. After doing some digging, I tracked it down to these two scheduler config options: min_file_process_interval (defaults to 0) and max_threads (defaults to 2).
As expected, increasing min_file_process_interval avoids the tight loop and drops cpu usage when it goes idle. But what I don't understand is why min_file_process_interval affects tasks execution?
If I set min_file_process_interval to 60s, it now waits no less than 60s between executing each task in my DAG, so if my dag has 4 sequential tasks it has now added 4 minutes to my execution time. For example:
start -> [task1] -> [task2] -> [task3] -> [task4]
^ ^ ^ ^
60s 60s 60s 60s
I have Airflow setup in my test env and prod env. This is less of an issue in my prod env (although still concerning), but a big issue for my test env. After the upgrade the CPU usage is significantly higher so either I accept higher CPU usage or try to decrease it with a higher config value. However, this adds significant time to my test dags execution time.
Why does min_file_process_interval affect time between tasks after the DAG has been scheduled? Are there other config options that could solve my issue?
The most likely cause is that there are too many python files in the dags folder, and the airflow scheduler scans the instantiated DAG too much.
It is recommended to reduce the number of dag files under scheduler and worker first. At the same time, the SCHEDULER_HEARTBEAT_SEC and MAX_THREADS values are set as large as possible.
Another option you might want to look into is
SCHEDULER_HEARTBEAT_SEC
This setting is usually also set to a very tight interval but could loosened up a bit. This setting in combination with
MAX_THREADS
did the trick for us. The dev machines are fast enough for re-deployment but without a hot, glowing CPU which is good.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With